David Vogelbacher created SPARK-24983:
-----------------------------------------

             Summary: Collapsing multiple project statements with dependent 
When-Otherwise statements on the same column can OOM the driver
                 Key: SPARK-24983
                 URL: https://issues.apache.org/jira/browse/SPARK-24983
             Project: Spark
          Issue Type: Bug
          Components: Optimizer
    Affects Versions: 2.3.1
            Reporter: David Vogelbacher


Hi,
I noticed that writing a spark job that includes many sequential when-otherwise 
statements on the same column can easily OOM the driver while generating the 
optimized plan because the project node will grow exponentially in size.

Example:
{noformat}
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val df = Seq("a", "b", "c", "1").toDF("text")
df: org.apache.spark.sql.DataFrame = [text: string]

scala> var dfCaseWhen = df.filter($"text" =!= lit("0"))
dfCaseWhen: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text: 
string]

scala> for( a <- 1 to 5) {
     |     dfCaseWhen = dfCaseWhen.withColumn("text", when($"text" === 
lit(a.toString), lit("r" + a.toString)).otherwise($"text"))
     | }

scala> dfCaseWhen.queryExecution.analyzed
res6: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [CASE WHEN (text#12 = 5) THEN r5 ELSE text#12 END AS text#14]
+- Project [CASE WHEN (text#10 = 4) THEN r4 ELSE text#10 END AS text#12]
   +- Project [CASE WHEN (text#8 = 3) THEN r3 ELSE text#8 END AS text#10]
      +- Project [CASE WHEN (text#6 = 2) THEN r2 ELSE text#6 END AS text#8]
         +- Project [CASE WHEN (text#3 = 1) THEN r1 ELSE text#3 END AS text#6]
            +- Filter NOT (text#3 = 0)
               +- Project [value#1 AS text#3]
                  +- LocalRelation [value#1]

scala> dfCaseWhen.queryExecution.optimizedPlan
res5: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) 
THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE 
value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 
ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 
END END END = 4) THEN r4 ELSE CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) 
THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE 
value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 
ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 
END END END END = 5) THEN r5 ELSE CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN 
(value#1 = 1) THEN r1 ELSE va...
{noformat}

As one can see the optimized plan grows exponentially in the number of 
{{when-otherwise}} statements here.

I can see that this comes from the {{CollapseProject}} optimizer rule.
Maybe we should put a limit on the resulting size of the project node after 
collapsing and only collapse if we stay under the limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to