[
https://issues.apache.org/jira/browse/SPARK-22270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-22270:
---------------------------------
Labels: bulk-closed (was: )
> Renaming DF column breaks sparkPlan.outputOrdering
> --------------------------------------------------
>
> Key: SPARK-22270
> URL: https://issues.apache.org/jira/browse/SPARK-22270
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.0, 2.2.0
> Reporter: Yuri Bogomolov
> Priority: Major
> Labels: bulk-closed
>
> Renaming columns doesn't update ordering/distribution metadata. This may
> cause unnecessary data shuffles, and significantly affect performance.
> {code:java}
> val df = spark.sqlContext.range(0, 10)
> val sorted = df.sort("id")
> val renamed = sorted.withColumnRenamed("id", "id2")
> val sortedAgain = renamed.sort("id2")
> sortedAgain.explain(true)
> == Analyzed Logical Plan ==
> id2: bigint
> Sort [id2#6L ASC NULLS FIRST], true
> +- Project [id#0L AS id2#6L]
> +- Sort [id#0L ASC NULLS FIRST], true
> +- Range (0, 10, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Sort [id2#6L ASC NULLS FIRST], true
> +- Project [id#0L AS id2#6L]
> +- Sort [id#0L ASC NULLS FIRST], true
> +- Range (0, 10, step=1, splits=Some(4))
> == Physical Plan ==
> *Sort [id2#6L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id2#6L ASC NULLS FIRST, 200)
> +- *Project [id#0L AS id2#6L]
> +- *Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
> +- *Range (0, 10, step=1, splits=4)
> {code}
> You can see that the dataset is going to be sorted twice.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]