peter-toth opened a new pull request, #11683:
URL: https://github.com/apache/datafusion/pull/11683

   ## Which issue does this PR close?
   
   Part of https://github.com/apache/datafusion/issues/11194.
   
   ## Rationale for this change
   
   This PR contains 2 ideas:
   1. Unfortunately https://github.com/apache/datafusion/pull/10473 can cause 
preformance regression in some cases.
   This is because after that change `to_arrays()` return a boolean flag if it 
make sense to execute the 2nd rewriting traversal, that does the actual common 
expression extraction. E.g. if `found_common` is false `rewrite_expr()` is not 
executed:
   
https://github.com/apache/datafusion/blob/204e1bcbaa5fcd3cf3cbe045f9832ee2f669e92f/datafusion/optimizer/src/common_subexpr_eliminate.rs#L609-L622
   The problem with this approach is that `CommonSubexprEliminate` is not a 
`ApplyOrder::TopDown` type optimizer rule, but it organizes its traversal via 
calling `self.rewrite()` in `rewrite_expr()`. So if the top level node doesn't 
contain any common expressions the nodes below will not be considered for CSE.
   2. The current rule is not optimal as extracted common expressions are not 
considered for sub-expression elimination in the current rule exection. This is 
because the rule recurses into the child plan nodes with `self.rewrite()` and 
then adds the new projection from the extracted common expressions:
   
https://github.com/apache/datafusion/blob/204e1bcbaa5fcd3cf3cbe045f9832ee2f669e92f/datafusion/optimizer/src/common_subexpr_eliminate.rs#L278-L285
   The issue with this approach is that even the new projection can contains 
sub-expressions to eliminate.
   E.g. a plan like 
      ``` 
      Projection: (test.a + test.b) * (test.a + test.b) AS c1, (test.a + 
test.b) * (test.a + test.b) AS c2
          ...
      ```
      can be rewritten to:
      ```
      Projection: __common_expr_1 AS c1, __common_expr_1 AS c2
          Projection: __common_expr_2 * __common_expr_2 AS __common_expr_1, 
test.a, test.b, test.c
              Projection: test.a + test.b AS __common_expr_2, test.a, test.b, 
test.c\
                  ...
      ```
      but the current rule requires 2 rule executions (optimizer cycles) to 
reach the final plan.
      This can be improved by swapping the order of adding the new project and 
calling `self.rewrite()`. (I.e. we can make the rule top-down like.)
   
   ## What changes are included in this PR?
   This PR:
   - Changes `rewrite_expr()` into `find_common_exprs()` to extract common 
sub-expressions and rewrite an expression list. The step of recursing into 
child plan nodes is moved out from this method. This way `find_common_exprs()` 
can safely leverage the boolean of `to_arrays()` to skip the 2nd traversal.
   - Refactors `try_unary_plan()`, `try_optimize_aggregate()` and 
`try_optimize_window()`.
   
   ## Are these changes tested?
   
   Yes, added new UTs.
   
   ## Are there any user-facing changes?
   
   Yes, it fixes a possible performance regression.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to