[ https://issues.apache.org/jira/browse/SPARK-27707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun reassigned SPARK-27707: ------------------------------------- Assignee: Liang-Chi Hsieh > Performance issue using explode > ------------------------------- > > Key: SPARK-27707 > URL: https://issues.apache.org/jira/browse/SPARK-27707 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.0.0 > Reporter: Ohad Raviv > Assignee: Liang-Chi Hsieh > Priority: Major > > this is a corner case of SPARK-21657. > we have a case where we want to explode array inside a struct and also keep > some other columns of the struct. we again encounter a huge performance issue. > reconstruction code: > {code} > val df = spark.sparkContext.parallelize(Seq(("1", > Array.fill(M)({ > val i = math.random > (i.toString, (i + 1).toString, (i + 2).toString, (i + 3).toString) > })))).toDF("col", "arr") > .selectExpr("col", "struct(col, arr) as st") > .selectExpr("col", "st.col as col1", "explode(st.arr) as arr_col") > df.write.mode("overwrite").save("/tmp/blah") > {code} > a workaround is projecting before the explode: > {code} > val df = spark.sparkContext.parallelize(Seq(("1", > Array.fill(M)({ > val i = math.random > (i.toString, (i + 1).toString, (i + 2).toString, (i + 3).toString) > })))).toDF("col", "arr") > .selectExpr("col", "struct(col, arr) as st") > .withColumn("col1", $"st.col") > .selectExpr("col", "col1", "explode(st.arr) as arr_col") > df.write.mode("overwrite").save("/tmp/blah") > {code} > in this case the optimization done in SPARK-21657: > {code} > // prune unrequired references > case p @ Project(_, g: Generate) if p.references != g.outputSet => > val requiredAttrs = p.references -- g.producedAttributes ++ > g.generator.references > val newChild = prunedChild(g.child, requiredAttrs) > val unrequired = g.generator.references -- p.references > val unrequiredIndices = newChild.output.zipWithIndex.filter(t => > unrequired.contains(t._1)) > .map(_._2) > p.copy(child = g.copy(child = newChild, unrequiredChildIndex = > unrequiredIndices)) > {code} > doesn't work because `p.references` has whole the `st` struct as reference > and not just the projected field. > this causes the entire struct including the huge array field to get > duplicated as the number of array elements. > I know this is kind of a corner case but was really non trivial to > understand.. -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org