alamb opened a new issue, #4331: URL: https://github.com/apache/arrow-datafusion/issues/4331
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** After https://github.com/apache/arrow-datafusion/pull/4122 some of our plans look like this: ```text + "+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+", + "| plan_type | plan |", + "+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+", + "| logical_plan | Sort: cpu.host ASC NULLS LAST, cpu.load ASC NULLS LAST, cpu.time ASC NULLS LAST |", + "| | Projection: cpu.host, cpu.load, cpu.time |", + "| | TableScan: cpu projection=[host, load, time] |", + "| physical_plan | SortExec: [host@0 ASC NULLS LAST,load@1 ASC NULLS LAST,time@2 ASC NULLS LAST] |", + "| | ProjectionExec: expr=[host@0 as host, load@1 as load, time@2 as time] |", + "| | DeduplicateExec: [host@0 ASC,time@2 ASC] |", + "| | SortPreservingMergeExec: [host@0 ASC,time@2 ASC] |", + "| | SortExec: [host@0 ASC,time@2 ASC] |", + "| | UnionExec |", + "| | CoalesceBatchesExec: target_batch_size=4096 |", + "| | FilterExec: time@2 < -9223372036854775808 OR time@2 > -3600000000000 |", + "| | ParquetExec: limit=None, partitions=[1/1/1/1/<uuid>.parquet], output_ordering=[host@0 ASC, time@2 ASC], projection=[host, load, time] |", + "| | CoalesceBatchesExec: target_batch_size=4096 |", + "| | FilterExec: time@2 < -9223372036854775808 OR time@2 > -3600000000000 |", + "| | ParquetExec: limit=None, partitions=[1/1/1/1/<uuid>.parquet], output_ordering=[host@0 ASC, time@2 ASC], projection=[host, load, time] |", + "| | |", + "+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+", ``` The `CoalesceBatchesExec` has been added after the FilterExec to make sure there are reasonable sized RecordBatches. However the `BasicEnforcement` rule has added an unnecessary SortExec ``` + "| | SortExec: [host@0 ASC,time@2 ASC] |", + "| | UnionExec |", + "| | CoalesceBatchesExec: target_batch_size=4096 |", + "| | FilterExec: time@2 < -9223372036854775808 OR time@2 > -3600000000000 |", + "| | ParquetExec: limit=None, partitions=[1/1/1/1/<uuid>.parquet], output_ordering=[host@0 ASC, time@2 ASC], projection=[host, load, time] |", ``` It is unnecessary because the data coming out of the parquet exec is already ordered (by host and time) **Describe the solution you'd like** The SortExec after `CoalesceBatchesExec` should not be present I believe the issue is that `CoalesceBatchesExec` says its output is not sorted https://github.com/apache/arrow-datafusion/blob/7c07e4d77aa2da60a83cc3558643eeac01fa98ce/datafusion/core/src/physical_plan/coalesce_batches.rs#L99-L101 Where the actual `CoalesceBatchesExec` preserves the ordering of its inputs, and thus should report ordering the same as its children **Describe alternatives you've considered** A clear and concise description of any alternative solutions or features you've considered. **Additional context** https://github.com/influxdata/influxdb_iox/pull/6160 which took us a while to work through ramifications of https://github.com/apache/arrow-datafusion/pull/4122 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
