alamb opened a new issue, #4331:
URL: https://github.com/apache/arrow-datafusion/issues/4331

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   After https://github.com/apache/arrow-datafusion/pull/4122 some of our plans 
look like this:
   
   ```text
   +                
"+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+",
   +                "| plan_type     | plan                                     
                                                                                
                                                           |",
   +                
"+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+",
   +                "| logical_plan  | Sort: cpu.host ASC NULLS LAST, cpu.load 
ASC NULLS LAST, cpu.time ASC NULLS LAST                                         
                                                            |",
   +                "|               |   Projection: cpu.host, cpu.load, 
cpu.time                                                                        
                                                                  |",
   +                "|               |     TableScan: cpu projection=[host, 
load, time]                                                                     
                                                               |",
   +                "| physical_plan | SortExec: [host@0 ASC NULLS LAST,load@1 
ASC NULLS LAST,time@2 ASC NULLS LAST]                                           
                                                            |",
   +                "|               |   ProjectionExec: expr=[host@0 as host, 
load@1 as load, time@2 as time]                                                 
                                                            |",
   +                "|               |     DeduplicateExec: [host@0 ASC,time@2 
ASC]                                                                            
                                                            |",
   +                "|               |       SortPreservingMergeExec: [host@0 
ASC,time@2 ASC]                                                                 
                                                             |",
   +                "|               |         SortExec: [host@0 ASC,time@2 
ASC]                                                                            
                                                               |",
   +                "|               |           UnionExec                      
                                                                                
                                                           |",
   +                "|               |             CoalesceBatchesExec: 
target_batch_size=4096                                                          
                                                                   |",
   +                "|               |               FilterExec: time@2 < 
-9223372036854775808 OR time@2 > -3600000000000                                 
                                                                 |",
   +                "|               |                 ParquetExec: limit=None, 
partitions=[1/1/1/1/<uuid>.parquet], output_ordering=[host@0 ASC, time@2 ASC], 
projection=[host, load, time] |",
   +                "|               |             CoalesceBatchesExec: 
target_batch_size=4096                                                          
                                                                   |",
   +                "|               |               FilterExec: time@2 < 
-9223372036854775808 OR time@2 > -3600000000000                                 
                                                                 |",
   +                "|               |                 ParquetExec: limit=None, 
partitions=[1/1/1/1/<uuid>.parquet], output_ordering=[host@0 ASC, time@2 ASC], 
projection=[host, load, time] |",
   +                "|               |                                          
                                                                                
                                                           |",
   +                
"+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+",
   ```
   
   The `CoalesceBatchesExec` has been added after the FilterExec to make sure 
there are reasonable sized RecordBatches. 
   
   However  the `BasicEnforcement` rule has added an unnecessary SortExec
   
   ```
   +                "|               |         SortExec: [host@0 ASC,time@2 
ASC]                                                                            
                                                               |",
   +                "|               |           UnionExec                      
                                                                                
                                                           |",
   +                "|               |             CoalesceBatchesExec: 
target_batch_size=4096                                                          
                                                                   |",
   +                "|               |               FilterExec: time@2 < 
-9223372036854775808 OR time@2 > -3600000000000                                 
                                                                 |",
   +                "|               |                 ParquetExec: limit=None, 
partitions=[1/1/1/1/<uuid>.parquet], output_ordering=[host@0 ASC, time@2 ASC], 
projection=[host, load, time] |",
   ```
   
   It is unnecessary because the data coming out of the parquet exec is already 
ordered (by host and time)
   
   
   **Describe the solution you'd like**
   The SortExec after `CoalesceBatchesExec` should not be present
   
   I believe the issue is that `CoalesceBatchesExec` says its output is not 
sorted
   
   
https://github.com/apache/arrow-datafusion/blob/7c07e4d77aa2da60a83cc3558643eeac01fa98ce/datafusion/core/src/physical_plan/coalesce_batches.rs#L99-L101
   
   Where the actual `CoalesceBatchesExec` preserves the ordering of its inputs, 
and thus should report ordering the same as its children
   
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features 
you've considered.
   
   **Additional context**
   
   https://github.com/influxdata/influxdb_iox/pull/6160 which took us a while 
to work through ramifications of 
https://github.com/apache/arrow-datafusion/pull/4122


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to