ulysses-you commented on code in PR #38558: URL: https://github.com/apache/spark/pull/38558#discussion_r1017369252
########## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala: ########## @@ -209,6 +209,19 @@ case class AdaptiveSparkPlanExec( override def output: Seq[Attribute] = inputPlan.output + // Try our best to give a stable output partitioning and ordering. Review Comment: Unfortunately we hint this.. per my experience, user always caches an arbitrary df and use the cached df to build an another arbitrary df. So why can't we preserve the partitioning/ordering of the cached plan ? If you really feel inconsistency in AdaptiveSparkPlanExec, we can probably move to InMemoryRelationExec. My original idea is to do the both but feel a little overkill (requiredOrdering should be inferred separately like https://github.com/apache/spark/pull/35924) ```scala requiredDistribution.map(_.createPartitioning(conf.shufflePartitions)).getOrElse { if (isFinalPlan) { executedPlan.outputPartitioning } else { super.outputPartitioning } } ``` A useful distribution before caching is few in production since `repartition(col)` will introduce skew -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
