mustafasrepo commented on PR #4439:
URL: 
https://github.com/apache/arrow-datafusion/pull/4439#issuecomment-1333331141

   > #4332
   @mingmwang Thanks for your feedback. I saw this problem while working on a 
PR that removes unnecessary `SortExec` before `WindowAggExec` to improve 
efficiency. Consider the query below 
   
   ```sql
   SELECT
         c9,
         SUM(c9) OVER(ORDER BY c9) as sum1,
         SUM(c9) OVER(ORDER BY c9, c8) as sum2
         FROM aggregate_test_100
   ```
   
   Its physical plan is as follows 
   
   ```sql
   
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
                         |
   
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: aggregate_test_100.c9, 
SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c9 ASC NULLS LAST] AS 
sum1, SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c9 ASC NULLS 
LAST, aggregate_test_100.c8 ASC NULLS LAST] AS sum2     |
   |               |   WindowAggr: windowExpr=[[SUM(aggregate_test_100.c9) 
ORDER BY [aggregate_test_100.c9 ASC NULLS LAST]]]                               
                                                                                
                              |
   |               |     WindowAggr: windowExpr=[[SUM(aggregate_test_100.c9) 
ORDER BY [aggregate_test_100.c9 ASC NULLS LAST, aggregate_test_100.c8 ASC NULLS 
LAST]]]                                                                         
                            |
   |               |       TableScan: aggregate_test_100 projection=[c8, c9]    
                                                                                
                                                                                
                         |
   | physical_plan | ProjectionExec: expr=[c9@3 as c9, 
SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c9 ASC NULLS LAST]@0 as 
sum1, SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c9 ASC NULLS 
LAST, aggregate_test_100.c8 ASC NULLS LAST]@1 as sum2] |
   |               |   WindowAggExec: wdw=[SUM(aggregate_test_100.c9): Ok(Field 
{ name: "SUM(aggregate_test_100.c9)", data_type: UInt64, nullable: true, 
dict_id: 0, dict_is_ordered: false, metadata: None })]                          
                                |
   |               |     SortExec: [c9@2 ASC NULLS LAST]                        
                                                                                
                                                                                
                         |
   |               |       WindowAggExec: wdw=[SUM(aggregate_test_100.c9): 
Ok(Field { name: "SUM(aggregate_test_100.c9)", data_type: UInt64, nullable: 
true, dict_id: 0, dict_is_ordered: false, metadata: None })]                    
                                  |
   |               |         SortExec: [c9@1 ASC NULLS LAST,c8@0 ASC NULLS 
LAST]                                                                           
                                                                                
                              |
   |               |           CsvExec: 
files=[Users/akurmustafa/projects/synnada/arrow-datafusion-synnada/testing/data/csv/aggregate_test_100.csv],
 has_header=true, limit=None, projection=[c8, c9]                               
                                     |
   |               |                                                            
                                                                                
                                                                                
                         |
   
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   ```
   
   Second (from the bottom) `SortExec` can be removed from the physical plan 
safely. To remove unnecessary `SortExec` from the physical plan, I compare the 
required sorting of `WindowAggExec` with its `input_exec.output_ordering()` at 
[here](https://github.com/apache/arrow-datafusion/blob/fdc83e8524df30ac5d0ae097572b7c48dc686ba9/datafusion/core/src/physical_plan/planner.rs#L518).
 Second `WindowAggExec` requires input_ordering `c9@2 ASC NULLS LAST`. Its 
input `WindowAggExec` should have `output_ordering` , `c9@1 ASC NULLS LAST,c8@0 
ASC NULLS LAST`. However, `output_ordering` of the first `WindowAggExec` 
returns `None`. The reason is that since we did not put `SortExec` before it 
yet (It is is done at optimization pass), its input is `CsvExec` and its 
`ouput_ordering` is `None`. 
   In summary, unless we pass from `BasicEnforcement` optimization pass, the 
output ordering information of `WindowAggExec` would give output ordering of 
Executor before `SortExec` in the final physical plan.
   
   
   > @mustafasrepo Could you please explain more specifically why you can not 
rely on the current `output_ordering` result? Do you have your own physical 
optimizer rules?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to