[
https://issues.apache.org/jira/browse/SPARK-35410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun resolved SPARK-35410.
-----------------------------------
Fix Version/s: 3.2.0
Assignee: L. C. Hsieh
Resolution: Fixed
This is resolved via https://github.com/apache/spark/pull/32559
> Unused subexpressions leftover in WholeStageCodegen subexpression elimination
> -----------------------------------------------------------------------------
>
> Key: SPARK-35410
> URL: https://issues.apache.org/jira/browse/SPARK-35410
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 3.1.1
> Reporter: Adam Binford
> Assignee: L. C. Hsieh
> Priority: Major
> Fix For: 3.2.0
>
> Attachments: codegen.txt
>
>
> Trying to understand and debug the performance of some of our jobs, I started
> digging into what the whole stage codegen code was doing. We use a lot of
> case when statements, and I found that there were a lot of unused sub
> expressions that were left over after the subexpression elimination, and it
> gets worse the more expressions you have chained. The simple example:
> {code:java}
> import org.apache.spark.sql.functions._
> import spark.implicits._
> val myUdf = udf((s: String) => {
> println("In UDF")
> s.toUpperCase
> })
> spark.range(5).select(when(length(myUdf($"id")) > 0,
> length(myUdf($"id")))).show()
> {code}
> Running the code, you'll see "In UDF" printed out 10 times. And if you change
> both to log(length(myUdf($"id")), "In UDF" will print out 20 times (one more
> for a cast from int to double and one more for the actual log calculation I
> think).
> In the codegen for this (without the log), there are these initial
> subexpressions:
> {code:java}
> /* 076 */ UTF8String project_subExprValue_0 =
> project_subExpr_0(project_expr_0_0);
> /* 077 */ int project_subExprValue_1 =
> project_subExpr_1(project_expr_0_0);
> /* 078 */ UTF8String project_subExprValue_2 =
> project_subExpr_2(project_expr_0_0);
> {code}
> project_subExprValue_0 and project_subExprValue_2 are never actually used, so
> it's properly resolving the two expressions and sharing the result of
> project_subExprValue_1, but it's not removing the other sub expression calls
> it seems like.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]