[
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358209#comment-17358209
]
L. C. Hsieh edited comment on SPARK-35564 at 6/6/21, 7:49 PM:
--------------------------------------------------------------
For the case {{spark.range(2).select(coalesce($"id", myUdf($"id")),
coalesce($"id" + 1, myUdf($"id"))).show()}}, looks like it can possibly be
performance issue by pulling a subexpr that might not be executed for a row but
not a bug. Different to elsevalue in when, coalesce is not a condition
expression, it supposes all arguments can be executed without problem.
was (Author: viirya):
For the case {{spark.range(2).select(coalesce($"id", myUdf($"id")),
coalesce($"id" + 1, myUdf($"id"))).show()}}, looks like it can possibly be
performance issue by pulling a subexpr that might not be executed for a row but
not a bug. But different to elsevalue in when, coalesce is not a condition
expression, it supposes all arguments can be executed without problem.
> Support subexpression elimination for non-common branches of conditional
> expressions
> ------------------------------------------------------------------------------------
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.1.1
> Reporter: Adam Binford
> Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-33337 added support for pulling
> subexpressions out of branches of conditional expressions for expressions
> present in all branches. We should be able to take this a step further and
> pull out subexpressions for any branch, as long as that expression will
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed,
> otherwise we want it to be null.
> Because we have no otherwise value, `col` is not a candidate for
> subexpression elimination (you can see two regular expression replacements in
> the codegen). But whenever the length is greater than 0, we will have to
> execute the regular expression replacement twice. Since we know we will
> always calculate `col` at least once, it makes sense to consider that as a
> subexpression since we might need it again in the branch value. So we can
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split
> subexpressions) if the second evaluation doesn't happen, but this seems like
> it would be worth it for when it is evaluated the second time.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]