[PR] [WIP][SPARK-53399][SQL] Improve CollapseProject [spark]

via GitHub Wed, 27 Aug 2025 13:14:26 -0700


peter-toth opened a new pull request, #52149:
URL: https://github.com/apache/spark/pull/52149


   ### What changes were proposed in this pull request?
   
   Latest improvements to `CollapseProject` rule (like 
https://github.com/apache/spark/pull/33958) prevented duplicating expressive 
expressons, which brings considerable performance improvement in many cases, 
but there is one particular case when it can introduce significant perfoamance 
degradation. Consider a query where the adjacent project nodes don't get 
collapsed because they contain expensive, multiple referenced expressions, but 
the nodes also contain Python UDF expressions that otherwise wouldn't prevent 
project node collapsion. E.g.:
   ```
   Project a + a as a_plus_a, PythonUDF(...) as udf2, udf1
     Project <expensive calculation> as a, PythonUDF(...) as udf1
       ...
   ```
   In the above example the `CollapseProject` don't modify the 2 project nodes, 
which then causes 2 `BatchEvalPython` nodes to appear in the plan when 
`ExtractPythonUDFs` extracts them:
   ```
   Project a + a as a_plus_a, udf2, udf1
     BatchEvalPython PythonUDF(...) -> udf2
       Project <expensive calculation> as a, udf1
         BatchEvalPython PythonUDF(...) -> udf1
           ...
   ```
   The 2 `BatchEvalPython` nodes can cause significant 
serialization/deserialization overhead compared to the case when the original 
project nodes were collapsed and we had only 1 `BatchEvalPython` node.
   
   The old behaviour can be restored with setting 
`spark.sql.optimizer.avoidCollapseUDFWithExpensiveExpr=true`, but it is still 
not ideal as we lose the perforamnce improvement in other cases.
   
   This PR improves to the `CollapseProject` rule to not just collapse or don't 
collapse adjacent nodes, but be able to decide on individual expressions if it 
makes sense to merge them with expressions from the other node.
   
   After this PR the new `CollapseProject` rule modifies the above plan as:
   ```
   Project a + a as a_plus_a, PythonUDF(...) as udf2, PythonUDF(...) as udf1
     Project <expensive calculation> as a
       ...
   ```
   that is then transformed to:
   ```
   Project a + a as a_plus_a, udf2, udf1
     BatchEvalPython PythonUDF(...) -> udf2, PythonUDF(...) -> udf1
       Project <expensive calculation> as a
         ...
   ```
   
   ### Why are the changes needed?
   
   To fix performance regression caused by `CollapseProject`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New and existing UTs.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [WIP][SPARK-53399][SQL] Improve CollapseProject [spark]

Reply via email to