alamb opened a new pull request, #4885: URL: https://github.com/apache/arrow-datafusion/pull/4885
# Which issue does this PR close? Fixes https://github.com/apache/arrow-datafusion/issues/4883 the fix is one line. The rest of this PR is documentation and tests # Rationale for this change The repartition optimizer pass is destroying a pre-existing sort order by repartitioning the data. The plan is actually producing correct answers (which is good) but it was doing so by resorting the data :( There is much more backstory on https://github.com/apache/arrow-datafusion/issues/4883 # What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> # Are these changes tested? yes, new tests Without this change the tests fail like (added a repartition above the parquet exec) ``` ---- physical_optimizer::repartition::tests::repartition_does_not_destroy_sort_more_complex stdout ---- thread 'physical_optimizer::repartition::tests::repartition_does_not_destroy_sort_more_complex' panicked at 'assertion failed: `(left == right)` left: `["UnionExec", "SortRequiredExec", "ParquetExec: limit=None, partitions={1 group: [[x]]}, output_ordering=[c1@0 ASC], projection=[c1]", "FilterExec: c1@0", "RepartitionExec: partitioning=RoundRobinBatch(10)", "ParquetExec: limit=None, partitions={1 group: [[x]]}, projection=[c1]"]`, right: `["UnionExec", "SortRequiredExec", "RepartitionExec: partitioning=RoundRobinBatch(10)", "ParquetExec: limit=None, partitions={1 group: [[x]]}, output_ordering=[c1@0 ASC], projection=[c1]", "FilterExec: c1@0", "RepartitionExec: partitioning=RoundRobinBatch(10)", "ParquetExec: limit=None, partitions={1 group: [[x]]}, projection=[c1]"]`: expected: [ "UnionExec", "SortRequiredExec", "ParquetExec: limit=None, partitions={1 group: [[x]]}, output_ordering=[c1@0 ASC], projection=[c1]", "FilterExec: c1@0", "RepartitionExec: partitioning=RoundRobinBatch(10)", "ParquetExec: limit=None, partitions={1 group: [[x]]}, projection=[c1]", ] actual: [ "UnionExec", "SortRequiredExec", "RepartitionExec: partitioning=RoundRobinBatch(10)", "ParquetExec: limit=None, partitions={1 group: [[x]]}, output_ordering=[c1@0 ASC], projection=[c1]", "FilterExec: c1@0", "RepartitionExec: partitioning=RoundRobinBatch(10)", "ParquetExec: limit=None, partitions={1 group: [[x]]}, projection=[c1]", ] ``` # Are there any user-facing changes? I am not sure if this bug is hittable for other users. We hit it in IOx and I think UnboundedWindowExec and MergeJoin are susceptible to the same problem, but I am not sure how widely used they are -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
