alamb opened a new pull request, #4885:
URL: https://github.com/apache/arrow-datafusion/pull/4885

   # Which issue does this PR close?
   
   Fixes https://github.com/apache/arrow-datafusion/issues/4883
   
   the fix is one line. The rest of this PR is documentation and tests 
   
   # Rationale for this change
   
   The repartition optimizer pass is destroying a pre-existing sort order by 
repartitioning the data. The plan is actually producing correct answers (which 
is good) but it was doing so by resorting the data :(
   
   There is much more backstory on 
https://github.com/apache/arrow-datafusion/issues/4883
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is 
sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   
   # Are these changes tested?
   yes, new tests
   
   
   Without this change the tests fail like (added a repartition above the 
parquet exec)
   
   ```
   
   ---- 
physical_optimizer::repartition::tests::repartition_does_not_destroy_sort_more_complex
 stdout ----
   thread 
'physical_optimizer::repartition::tests::repartition_does_not_destroy_sort_more_complex'
 panicked at 'assertion failed: `(left == right)`
     left: `["UnionExec", "SortRequiredExec", "ParquetExec: limit=None, 
partitions={1 group: [[x]]}, output_ordering=[c1@0 ASC], projection=[c1]", 
"FilterExec: c1@0", "RepartitionExec: partitioning=RoundRobinBatch(10)", 
"ParquetExec: limit=None, partitions={1 group: [[x]]}, projection=[c1]"]`,
    right: `["UnionExec", "SortRequiredExec", "RepartitionExec: 
partitioning=RoundRobinBatch(10)", "ParquetExec: limit=None, partitions={1 
group: [[x]]}, output_ordering=[c1@0 ASC], projection=[c1]", "FilterExec: 
c1@0", "RepartitionExec: partitioning=RoundRobinBatch(10)", "ParquetExec: 
limit=None, partitions={1 group: [[x]]}, projection=[c1]"]`:
   
   expected:
   
   [
       "UnionExec",
       "SortRequiredExec",
       "ParquetExec: limit=None, partitions={1 group: [[x]]}, 
output_ordering=[c1@0 ASC], projection=[c1]",
       "FilterExec: c1@0",
       "RepartitionExec: partitioning=RoundRobinBatch(10)",
       "ParquetExec: limit=None, partitions={1 group: [[x]]}, projection=[c1]",
   ]
   actual:
   
   [
       "UnionExec",
       "SortRequiredExec",
       "RepartitionExec: partitioning=RoundRobinBatch(10)",
       "ParquetExec: limit=None, partitions={1 group: [[x]]}, 
output_ordering=[c1@0 ASC], projection=[c1]",
       "FilterExec: c1@0",
       "RepartitionExec: partitioning=RoundRobinBatch(10)",
       "ParquetExec: limit=None, partitions={1 group: [[x]]}, projection=[c1]",
   ]
   
   
   ```
   
   # Are there any user-facing changes?
   
   I am not sure if this bug is hittable for other users. We hit it in IOx and 
I think UnboundedWindowExec and MergeJoin are susceptible to the same problem, 
but I am not sure how widely used they are


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to