[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354447#comment-17354447
 ] 

Adam Binford commented on SPARK-35564:
--------------------------------------

>Do you mean "Create a subexpression if an expression will always be evaluated 
>at least once AND will be evaluated at least once in conditional expression"?

Yeah you can think of it that way in terms of adding to existing functionality. 
I was trying to word it in a way that encompassed existing functionality as 
well.

>And this looks like a corner case, so I'm not sure if it is worth to do this.

I don't really think this is much of a corner case, but a common case of using 
a when expression for data validation. Most of our ETL process comes down to 
normalizing, cleaning, and validating strings, which at the end of the day 
usually looks like:
{code:java}
column = normalize_value(col('my_raw_value'))
result = when(column != '', column){code}
where "normalize_value" usually involves some combination of regexp_repace's, 
lower/upper, and trim.

And things get worse when you are dealing with arrays of strings and want to 
minimize your data:
{code:java}
column = filter(transform(col('my_raw_array_value'), lambda x: 
normalize_value(x)), lambda x: x != '')
result = when(size(column) > 0, column){code}
though currently higher order functions are always semantically different so 
they don't get subexpressions regardless I think. That's something I plan to 
look into as a follow up.

It's natural for users to think that these expressions only get evaluated once, 
and not that they are doubling their runtime trying to clean their data. To me 
the edge case is creating a subexpression in this case decreasing throughput. 
It would require a very large percentage of the rows to not pass the 
conditional check, since the additional calculation is much more expensive than 
the additional function call. I'm playing around with an implementation so 
we'll see how far I can get with it.

 

> Support subexpression elimination for non-common branches of conditional 
> expressions
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-35564
>                 URL: https://issues.apache.org/jira/browse/SPARK-35564
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.1
>            Reporter: Adam Binford
>            Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-33337 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to