peter-toth opened a new pull request, #11683: URL: https://github.com/apache/datafusion/pull/11683
## Which issue does this PR close? Part of https://github.com/apache/datafusion/issues/11194. ## Rationale for this change This PR contains 2 ideas: 1. Unfortunately https://github.com/apache/datafusion/pull/10473 can cause preformance regression in some cases. This is because after that change `to_arrays()` return a boolean flag if it make sense to execute the 2nd rewriting traversal, that does the actual common expression extraction. E.g. if `found_common` is false `rewrite_expr()` is not executed: https://github.com/apache/datafusion/blob/204e1bcbaa5fcd3cf3cbe045f9832ee2f669e92f/datafusion/optimizer/src/common_subexpr_eliminate.rs#L609-L622 The problem with this approach is that `CommonSubexprEliminate` is not a `ApplyOrder::TopDown` type optimizer rule, but it organizes its traversal via calling `self.rewrite()` in `rewrite_expr()`. So if the top level node doesn't contain any common expressions the nodes below will not be considered for CSE. 2. The current rule is not optimal as extracted common expressions are not considered for sub-expression elimination in the current rule exection. This is because the rule recurses into the child plan nodes with `self.rewrite()` and then adds the new projection from the extracted common expressions: https://github.com/apache/datafusion/blob/204e1bcbaa5fcd3cf3cbe045f9832ee2f669e92f/datafusion/optimizer/src/common_subexpr_eliminate.rs#L278-L285 The issue with this approach is that even the new projection can contains sub-expressions to eliminate. E.g. a plan like ``` Projection: (test.a + test.b) * (test.a + test.b) AS c1, (test.a + test.b) * (test.a + test.b) AS c2 ... ``` can be rewritten to: ``` Projection: __common_expr_1 AS c1, __common_expr_1 AS c2 Projection: __common_expr_2 * __common_expr_2 AS __common_expr_1, test.a, test.b, test.c Projection: test.a + test.b AS __common_expr_2, test.a, test.b, test.c\ ... ``` but the current rule requires 2 rule executions (optimizer cycles) to reach the final plan. This can be improved by swapping the order of adding the new project and calling `self.rewrite()`. (I.e. we can make the rule top-down like.) ## What changes are included in this PR? This PR: - Changes `rewrite_expr()` into `find_common_exprs()` to extract common sub-expressions and rewrite an expression list. The step of recursing into child plan nodes is moved out from this method. This way `find_common_exprs()` can safely leverage the boolean of `to_arrays()` to skip the 2nd traversal. - Refactors `try_unary_plan()`, `try_optimize_aggregate()` and `try_optimize_window()`. ## Are these changes tested? Yes, added new UTs. ## Are there any user-facing changes? Yes, it fixes a possible performance regression. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org