[GitHub] [spark] c21 commented on a change in pull request #35574: [SPARK-38237][SQL][SS] Allow `HashPartitioning` to satisfy `ClusteredDistribution` only with full clustering keys

GitBox Tue, 22 Feb 2022 14:03:29 -0800


c21 commented on a change in pull request #35574:
URL: https://github.com/apache/spark/pull/35574#discussion_r812395628




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -261,8 +261,16 @@ case class HashPartitioning(expressions: Seq[Expression], 
numPartitions: Int)
           expressions.length == h.expressions.length && 
expressions.zip(h.expressions).forall {
             case (l, r) => l.semanticEquals(r)
           }
-        case ClusteredDistribution(requiredClustering, _) =>
-          expressions.forall(x => 
requiredClustering.exists(_.semanticEquals(x)))
+        case c @ ClusteredDistribution(requiredClustering, _) =>
+          if (SQLConf.get.requireAllClusterKeysForHashPartition) {

Review comment:
       > Shuffle may not be something we should try hard to eliminate at all. 
We also need to think when the shuffle would be likely help. We can't leverage 
stats in physical execution, so my fall back goes to be heuristic, like the 
different view and resolution on this problem 
https://github.com/apache/spark/pull/35574#issuecomment-1047303366.
   
   > Instead, having a threshold (minimum) of the number of partitions doesn't 
sound crazy for me. The threshold could be heuristic one, or config - number or 
ratio compared to the default number of shuffle partitions, or default number 
of shuffle partitions if we wouldn't want to bring another config (but it may 
be too high to use for minimum).
   
   @HeartSaVioR - First I agree with you sometime shuffle is good to have, so I 
guess this PR is aiming for the same goal - add proper shuffle on full 
clustering keys based on config, right? Just for my understanding, are you 
proposing above to have a config to set the minimal threshold for number of 
partitions for all queries needed shuffle? Can you elaborate more how query and 
which part of query would be rewritten if violating the config? With 
cardinality stats from CBO and AQE (we don't have cardinality stats collected 
in AQE for now), we may potentially give some hint in query plan during logical 
planning. But this approach still sounds a little bit too high level for me, 
without elaborating details of algorithm.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] c21 commented on a change in pull request #35574: [SPARK-38237][SQL][SS] Allow `HashPartitioning` to satisfy `ClusteredDistribution` only with full clustering keys

Reply via email to