mustafasrepo commented on PR #4439: URL: https://github.com/apache/arrow-datafusion/pull/4439#issuecomment-1333331141
> #4332 @mingmwang Thanks for your feedback. I saw this problem while working on a PR that removes unnecessary `SortExec` before `WindowAggExec` to improve efficiency. Consider the query below ```sql SELECT c9, SUM(c9) OVER(ORDER BY c9) as sum1, SUM(c9) OVER(ORDER BY c9, c8) as sum2 FROM aggregate_test_100 ``` Its physical plan is as follows ```sql +---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Projection: aggregate_test_100.c9, SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c9 ASC NULLS LAST] AS sum1, SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c9 ASC NULLS LAST, aggregate_test_100.c8 ASC NULLS LAST] AS sum2 | | | WindowAggr: windowExpr=[[SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c9 ASC NULLS LAST]]] | | | WindowAggr: windowExpr=[[SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c9 ASC NULLS LAST, aggregate_test_100.c8 ASC NULLS LAST]]] | | | TableScan: aggregate_test_100 projection=[c8, c9] | | physical_plan | ProjectionExec: expr=[c9@3 as c9, SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c9 ASC NULLS LAST]@0 as sum1, SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c9 ASC NULLS LAST, aggregate_test_100.c8 ASC NULLS LAST]@1 as sum2] | | | WindowAggExec: wdw=[SUM(aggregate_test_100.c9): Ok(Field { name: "SUM(aggregate_test_100.c9)", data_type: UInt64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None })] | | | SortExec: [c9@2 ASC NULLS LAST] | | | WindowAggExec: wdw=[SUM(aggregate_test_100.c9): Ok(Field { name: "SUM(aggregate_test_100.c9)", data_type: UInt64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None })] | | | SortExec: [c9@1 ASC NULLS LAST,c8@0 ASC NULLS LAST] | | | CsvExec: files=[Users/akurmustafa/projects/synnada/arrow-datafusion-synnada/testing/data/csv/aggregate_test_100.csv], has_header=true, limit=None, projection=[c8, c9] | | | | +---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` Second (from the bottom) `SortExec` can be removed from the physical plan safely. To remove unnecessary `SortExec` from the physical plan, I compare the required sorting of `WindowAggExec` with its `input_exec.output_ordering()` at [here](https://github.com/apache/arrow-datafusion/blob/fdc83e8524df30ac5d0ae097572b7c48dc686ba9/datafusion/core/src/physical_plan/planner.rs#L518). Second `WindowAggExec` requires input_ordering `c9@2 ASC NULLS LAST`. Its input `WindowAggExec` should have `output_ordering` , `c9@1 ASC NULLS LAST,c8@0 ASC NULLS LAST`. However, `output_ordering` of the first `WindowAggExec` returns `None`. The reason is that since we did not put `SortExec` before it yet (It is is done at optimization pass), its input is `CsvExec` and its `ouput_ordering` is `None`. In summary, unless we pass from `BasicEnforcement` optimization pass, the output ordering information of `WindowAggExec` would give output ordering of Executor before `SortExec` in the final physical plan. > @mustafasrepo Could you please explain more specifically why you can not rely on the current `output_ordering` result? Do you have your own physical optimizer rules? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org