[jira] [Comment Edited] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

L. C. Hsieh (Jira) Sun, 06 Jun 2021 12:50:04 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358209#comment-17358209
 ]


L. C. Hsieh edited comment on SPARK-35564 at 6/6/21, 7:49 PM:
--------------------------------------------------------------

For the case {{spark.range(2).select(coalesce($"id", myUdf($"id")), 
coalesce($"id" + 1, myUdf($"id"))).show()}}, looks like it can possibly be 
performance issue by pulling a subexpr that might not be executed for a row but 
not a bug. Different to elsevalue in when, coalesce is not a condition 
expression, it supposes all arguments can be executed without problem.


was (Author: viirya):
For the case {{spark.range(2).select(coalesce($"id", myUdf($"id")), 
coalesce($"id" + 1, myUdf($"id"))).show()}}, looks like it can possibly be 
performance issue by pulling a subexpr that might not be executed for a row but 
not a bug. But different to elsevalue in when, coalesce is not a condition 
expression, it supposes all arguments can be executed without problem.

> Support subexpression elimination for non-common branches of conditional 
> expressions
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-35564
>                 URL: https://issues.apache.org/jira/browse/SPARK-35564
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.1
>            Reporter: Adam Binford
>            Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-33337 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

Reply via email to