peter-toth opened a new pull request, #41677: URL: https://github.com/apache/spark/pull/41677
### What changes were proposed in this pull request? This PR proposes a new way to do subexpression elimination in `EquivalentExpressions`. The main change of the PR is that `ExpressionStats` stores the expected evaluation count of subexpressions split into `evalCount` that records sure evaluations and `conditionalEvalCount` that records conditional evaluations. For the sake of simplicity all the branches are modelled with `0.5` probability in this PR. Here are a few example expressions and the `ExpressionStats` of a non-leaf `c` expression from the equivalence maps built from the expressions: | Expression | `ExpressionStats` of `c` | |---|---| |`c` | `c -> (1 + 0.0)` | |`c + c` | `c -> (2 + 0.0)` | |`If(_, c, _)` | `c -> (0 + 0.5)` | |`If(_, c + c, _)` | `c -> (0 + 1.0)` | |`If(_, c, c)` | `c -> (1 + 0.0)` | |`If(c, c, _)` | `c -> (1 + 0.5)` | This PR: - Fixes the issue of subexpressions that are surely evaluated only once but there is a certain probability that they are evaluated more. These subexpressions are now considered common based on the newly introduced `spark.sql.subexpressionElimination.minExpectedConditionalEvaluationCount` config. - Fixes the issue of branching groups in `CaseWhen` and `Coalesce` expressions. Branching groups were used for calculating common subexpressions in conditional branches based on the idea that subexpressions that appear in all elements of a group are surely evaluated once. If we take the `CaseWhen(w1, t1, w2, t2, w3, t3, e)` example then the previously defined (`t1`, `t2`,`t3`, `e`) group made sense, but for some reason the (`w1`, `w2`, `w3`) group was also defined, which didn't make sense because `w1` was also considered always evaluated. Also, some other groups that would have made sense (`t1`, `w2`) or (`t1`, `t2`, `w3`) were not defined. This PR completely removes branching groups and uses a new way to calculate `ExpressionStats`. ### Why are the changes needed? Improve subexpression elimination. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing and new UTs. _Please note that this PR is still WIP, I will add more conditional tests..._ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
