ulysses-you commented on code in PR #38558:
URL: https://github.com/apache/spark/pull/38558#discussion_r1017369252


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##########
@@ -209,6 +209,19 @@ case class AdaptiveSparkPlanExec(
 
   override def output: Seq[Attribute] = inputPlan.output
 
+  // Try our best to give a stable output partitioning and ordering.

Review Comment:
   Unfortunately we hint this.. per my experience, user always caches an 
arbitrary df and use the cached df to build an another arbitrary df. So why 
can't we preserve the partitioning/ordering of the cached plan ? If you really 
feel inconsistency in AdaptiveSparkPlanExec, we can probably move to 
InMemoryRelationExec.
   
   My original idea is to do the both but feel a little overkill 
(requiredOrdering should be inferred separately like 
https://github.com/apache/spark/pull/35924)
   ```scala
   
requiredDistribution.map(_.createPartitioning(conf.shufflePartitions)).getOrElse
 {
     if (isFinalPlan) {
       executedPlan.outputPartitioning
     } else {
       super.outputPartitioning
     }
   }
   ```
   A useful distribution before caching is few in production since 
`repartition(col)` will introduce skew



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to