mustafasrepo commented on PR #4439:
URL:
https://github.com/apache/arrow-datafusion/pull/4439#issuecomment-1333331141
> #4332
@mingmwang Thanks for your feedback. I saw this problem while working on a
PR that removes unnecessary `SortExec` before `WindowAggExec` to improve
efficiency. Consider the query below
```sql
SELECT
c9,
SUM(c9) OVER(ORDER BY c9) as sum1,
SUM(c9) OVER(ORDER BY c9, c8) as sum2
FROM aggregate_test_100
```
Its physical plan is as follows
```sql
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: aggregate_test_100.c9,
SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c9 ASC NULLS LAST] AS
sum1, SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c9 ASC NULLS
LAST, aggregate_test_100.c8 ASC NULLS LAST] AS sum2 |
| | WindowAggr: windowExpr=[[SUM(aggregate_test_100.c9)
ORDER BY [aggregate_test_100.c9 ASC NULLS LAST]]]
|
| | WindowAggr: windowExpr=[[SUM(aggregate_test_100.c9)
ORDER BY [aggregate_test_100.c9 ASC NULLS LAST, aggregate_test_100.c8 ASC NULLS
LAST]]]
|
| | TableScan: aggregate_test_100 projection=[c8, c9]
|
| physical_plan | ProjectionExec: expr=[c9@3 as c9,
SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c9 ASC NULLS LAST]@0 as
sum1, SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c9 ASC NULLS
LAST, aggregate_test_100.c8 ASC NULLS LAST]@1 as sum2] |
| | WindowAggExec: wdw=[SUM(aggregate_test_100.c9): Ok(Field
{ name: "SUM(aggregate_test_100.c9)", data_type: UInt64, nullable: true,
dict_id: 0, dict_is_ordered: false, metadata: None })]
|
| | SortExec: [c9@2 ASC NULLS LAST]
|
| | WindowAggExec: wdw=[SUM(aggregate_test_100.c9):
Ok(Field { name: "SUM(aggregate_test_100.c9)", data_type: UInt64, nullable:
true, dict_id: 0, dict_is_ordered: false, metadata: None })]
|
| | SortExec: [c9@1 ASC NULLS LAST,c8@0 ASC NULLS
LAST]
|
| | CsvExec:
files=[Users/akurmustafa/projects/synnada/arrow-datafusion-synnada/testing/data/csv/aggregate_test_100.csv],
has_header=true, limit=None, projection=[c8, c9]
|
| |
|
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
Second (from the bottom) `SortExec` can be removed from the physical plan
safely. To remove unnecessary `SortExec` from the physical plan, I compare the
required sorting of `WindowAggExec` with its `input_exec.output_ordering()` at
[here](https://github.com/apache/arrow-datafusion/blob/fdc83e8524df30ac5d0ae097572b7c48dc686ba9/datafusion/core/src/physical_plan/planner.rs#L518).
Second `WindowAggExec` requires input_ordering `c9@2 ASC NULLS LAST`. Its
input `WindowAggExec` should have `output_ordering` , `c9@1 ASC NULLS LAST,c8@0
ASC NULLS LAST`. However, `output_ordering` of the first `WindowAggExec`
returns `None`. The reason is that since we did not put `SortExec` before it
yet (It is is done at optimization pass), its input is `CsvExec` and its
`ouput_ordering` is `None`.
In summary, unless we pass from `BasicEnforcement` optimization pass, the
output ordering information of `WindowAggExec` would give output ordering of
Executor before `SortExec` in the final physical plan.
> @mustafasrepo Could you please explain more specifically why you can not
rely on the current `output_ordering` result? Do you have your own physical
optimizer rules?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]