[
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358275#comment-17358275
]
Adam Binford edited comment on SPARK-35564 at 6/6/21, 10:50 PM:
----------------------------------------------------------------
No the values are fine, it's the tail conditions that cause the issue.
{code:java}
spark.range(2).select(when($"id" >= 0, lit(1)).when(myUdf($"id") > 0, lit(2)),
when($"id" > -1, lit(1)).when(myUdf($"id") > 0, lit(2))).show(){code}
Here myUdf($"id") gets pulled out as a subexpression even though it never
should be evaluated.
was (Author: kimahriman):
No the values are fine, it's the condition that cause the issue.
{code:java}
spark.range(2).select(when($"id" >= 0, lit(1)).when(myUdf($"id") > 0, lit(2)),
when($"id" > -1, lit(1)).when(myUdf($"id") > 0, lit(2))).show(){code}
Here myUdf($"id") gets pulled out as a subexpression even though it never
should be evaluated.
> Support subexpression elimination for non-common branches of conditional
> expressions
> ------------------------------------------------------------------------------------
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.1.1
> Reporter: Adam Binford
> Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-33337 added support for pulling
> subexpressions out of branches of conditional expressions for expressions
> present in all branches. We should be able to take this a step further and
> pull out subexpressions for any branch, as long as that expression will
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed,
> otherwise we want it to be null.
> Because we have no otherwise value, `col` is not a candidate for
> subexpression elimination (you can see two regular expression replacements in
> the codegen). But whenever the length is greater than 0, we will have to
> execute the regular expression replacement twice. Since we know we will
> always calculate `col` at least once, it makes sense to consider that as a
> subexpression since we might need it again in the branch value. So we can
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split
> subexpressions) if the second evaluation doesn't happen, but this seems like
> it would be worth it for when it is evaluated the second time.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]