jonahgao opened a new issue, #8296:
URL: https://github.com/apache/arrow-datafusion/issues/8296
### Describe the bug
Rule `push_down_projection` negates the optimization results of
`common_sub_expression_eliminate`, but leaves the useless aliases introduced by
`common_sub_expression_eliminate`. Those added aliases change the signature of
the logical plan, also causing the optimizer to never reach the fixed point.
In summary, there are two problems:
1. The optimization of `common_sub_expression_elimination` did not work.
2. Ineffective optimization leads to the optimizer not being able to exit
early until it reaches the limit of `datafusion.optimizer.max_passes`.
### To Reproduce
Run `explain verbose select a/2, a/2 + 1 from t` in CLI
```sh
DataFusion CLI v33.0.0
❯ create table t(a bigint);
0 rows in set. Query took 0.006 seconds.
❯ explain verbose select a/2, a/2 + 1 from t;
| logical_plan after common_sub_expression_eliminate | Projection:
t.a / Int64(2)Int64(2)t.a AS t.a / Int64(2), t.a / Int64(2)Int64(2)t.a AS t.a /
Int64(2) + Int64(1)
|
| | Projection:
t.a / Int64(2) AS t.a / Int64(2)Int64(2)t.a, t.a
|
| |
TableScan: t
| ...
|
| logical_plan after push_down_projection | Projection:
t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2), t.a / Int64(2) AS t.a /
Int64(2) AS t.a / Int64(2) + Int64(1)
|
| | TableScan:
t projection=[a]
❯ explain select a/2, a/2 + 1 from t;
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: t.a / Int64(2) AS t.a / Int64(2) AS t.a /
Int64(2) AS t.a / Int64(2), t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2)
AS t.a / Int64(2) + Int64(1) |
| | TableScan: t projection=[a]
|
| physical_plan | ProjectionExec: expr=[a@0 / 2 as t.a / Int64(2), a@0 / 2 +
1 as t.a / Int64(2) + Int64(1)]
|
| | MemoryExec: partitions=1, partition_sizes=[0]
|
| |
|
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
The final projection is:
- `a/2`: t.a / Int64(2) **AS** t.a / Int64(2) **AS** t.a / Int64(2) **AS**
t.a / Int64(2)
- `a/2+1`: t.a / Int64(2) **AS** t.a / Int64(2) **AS** t.a / Int64(2) **AS**
t.a / Int64(2) + Int64(1)
Three duplicate aliases appeared, corresponding to
`datafusion.optimizer.max_passes=3`.
If I set `datafusion.optimizer.max_passes=10`
```sh
❯ set datafusion.optimizer.max_passes=10;
0 rows in set. Query took 0.002 seconds.
❯ explain select a/2, a/2 + 1 from t;
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: t.a / Int64(2) AS t.a / Int64(2) AS t.a /
Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a /
Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a /
Int64(2), t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2)
AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a
/ Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) + Int64(1) |
| | TableScan: t projection=[a]
|
| physical_plan | ProjectionExec: expr=[a@0 / 2 as t.a / Int64(2), a@0 / 2 +
1 as t.a / Int64(2) + Int64(1)]
|
| | MemoryExec: partitions=1, partition_sizes=[0]
|
| |
|
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.025 seconds.
```
One of the expr in the final projection will be `t.a / Int64(2) AS t.a /
Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a /
Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a /
Int64(2) AS t.a / Int64(2)`.
There are too many unnecessary aliases.
### Expected behavior
The logic `merge_projection` inside the rule `push_down_projection` should
not undo `common_sub_expression_eliminate`.
### Additional context
I plan to work on this in the next few days.
My initial thought is to fix it in the `push_down_projection` side.
Do not execute `merge_projection` if an expression in the child projection
satisfies the following conditions:
1. It has been referenced by the parent projection two or more times.
2. Its evaluation is non-trivial. Its type is not `Column` or `Literal`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]