peter-toth opened a new pull request, #52149: URL: https://github.com/apache/spark/pull/52149
### What changes were proposed in this pull request? Latest improvements to `CollapseProject` rule (like https://github.com/apache/spark/pull/33958) prevented duplicating expressive expressons, which brings considerable performance improvement in many cases, but there is one particular case when it can introduce significant perfoamance degradation. Consider a query where the adjacent project nodes don't get collapsed because they contain expensive, multiple referenced expressions, but the nodes also contain Python UDF expressions that otherwise wouldn't prevent project node collapsion. E.g.: ``` Project a + a as a_plus_a, PythonUDF(...) as udf2, udf1 Project <expensive calculation> as a, PythonUDF(...) as udf1 ... ``` In the above example the `CollapseProject` don't modify the 2 project nodes, which then causes 2 `BatchEvalPython` nodes to appear in the plan when `ExtractPythonUDFs` extracts them: ``` Project a + a as a_plus_a, udf2, udf1 BatchEvalPython PythonUDF(...) -> udf2 Project <expensive calculation> as a, udf1 BatchEvalPython PythonUDF(...) -> udf1 ... ``` The 2 `BatchEvalPython` nodes can cause significant serialization/deserialization overhead compared to the case when the original project nodes were collapsed and we had only 1 `BatchEvalPython` node. The old behaviour can be restored with setting `spark.sql.optimizer.avoidCollapseUDFWithExpensiveExpr=true`, but it is still not ideal as we lose the perforamnce improvement in other cases. This PR improves to the `CollapseProject` rule to not just collapse or don't collapse adjacent nodes, but be able to decide on individual expressions if it makes sense to merge them with expressions from the other node. After this PR the new `CollapseProject` rule modifies the above plan as: ``` Project a + a as a_plus_a, PythonUDF(...) as udf2, PythonUDF(...) as udf1 Project <expensive calculation> as a ... ``` that is then transformed to: ``` Project a + a as a_plus_a, udf2, udf1 BatchEvalPython PythonUDF(...) -> udf2, PythonUDF(...) -> udf1 Project <expensive calculation> as a ... ``` ### Why are the changes needed? To fix performance regression caused by `CollapseProject`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New and existing UTs. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org