[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354590#comment-17354590
 ] 

L. C. Hsieh commented on SPARK-35564:
-------------------------------------

> I don't really think this is much of a corner case, but a common case of 
> using a when expression for data validation. Most of our ETL process comes 
> down to normalizing, cleaning, and validating strings, which at the end of 
> the day usually looks like:

This is a corner case because it simplifies other possible cases, although you 
might actually use this pattern in your ETL process.

For example, when we treat an always-evaluate-at-least-once and 
optionally-evaluate-at-least-once expression as subexpression, there are many 
expressions qualified for this. A child expression of the first predicate of 
when, if it is also part of any conditional predicate/value, might also be 
treated as subexpression. Finally we might end with tons of subexpressions like 
that to flood generated code.

On the other hand, how much gain we can get from this case? In the example, for 
the worst case we evaluate it twice, not 5 or 10 times. It may be just small 
piece of the entire ETL process. I feel it's not worth because we might pay a 
lot cost including making the code more complicated and creating tons of 
subexpressions, but in the end we only get a little bit from it and it is also 
only for a worst case.

> though currently higher order functions are always semantically different so 
> they don't get subexpressions regardless I think. That's something I plan to 
> look into as a follow up.

Oh, this is another issue. I noticed it last time when I worked on another PR 
recently, but don't have time to look at it yet.



> Support subexpression elimination for non-common branches of conditional 
> expressions
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-35564
>                 URL: https://issues.apache.org/jira/browse/SPARK-35564
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.1
>            Reporter: Adam Binford
>            Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-33337 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to