[jira] [Comment Edited] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

Adam Binford (Jira) Sun, 06 Jun 2021 15:51:06 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358275#comment-17358275
 ]


Adam Binford edited comment on SPARK-35564 at 6/6/21, 10:50 PM:
----------------------------------------------------------------

No the values are fine, it's the tail conditions that cause the issue.
{code:java}
spark.range(2).select(when($"id" >= 0, lit(1)).when(myUdf($"id") > 0, lit(2)), 
when($"id" > -1, lit(1)).when(myUdf($"id") > 0, lit(2))).show(){code}
Here myUdf($"id") gets pulled out as a subexpression even though it never 
should be evaluated.


was (Author: kimahriman):
No the values are fine, it's the condition that cause the issue.
{code:java}
spark.range(2).select(when($"id" >= 0, lit(1)).when(myUdf($"id") > 0, lit(2)), 
when($"id" > -1, lit(1)).when(myUdf($"id") > 0, lit(2))).show(){code}
Here myUdf($"id") gets pulled out as a subexpression even though it never 
should be evaluated.

> Support subexpression elimination for non-common branches of conditional 
> expressions
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-35564
>                 URL: https://issues.apache.org/jira/browse/SPARK-35564
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.1
>            Reporter: Adam Binford
>            Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-33337 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

Reply via email to