[GitHub] [spark] HeartSaVioR commented on a change in pull request #35574: [SPARK-38237][SQL][SS] Allow `HashPartitioning` to satisfy `ClusteredDistribution` only with full clustering keys

GitBox Sun, 20 Feb 2022 15:26:26 -0800


HeartSaVioR commented on a change in pull request #35574:
URL: https://github.com/apache/spark/pull/35574#discussion_r810701781




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -261,8 +261,16 @@ case class HashPartitioning(expressions: Seq[Expression], 
numPartitions: Int)
           expressions.length == h.expressions.length && 
expressions.zip(h.expressions).forall {
             case (l, r) => l.semanticEquals(r)
           }
-        case ClusteredDistribution(requiredClustering, _) =>
-          expressions.forall(x => 
requiredClustering.exists(_.semanticEquals(x)))
+        case c @ ClusteredDistribution(requiredClustering, _) =>
+          if (SQLConf.get.requireAllClusterKeysForHashPartition) {

Review comment:
       > As a Spark developer, I was originally confused when seeing both 
HashClusteredDistribution and ClusteredDistribution and had to navigate the 
code base and reason about their behavior differences.
   
   The classdoc of two classes were very clear about differences. The confusion 
may come from the structure we have actual logic about requirement check for 
distribution in `partitioning` instead of distribution (I'm not an expert of 
this part, it might have to go with this way), but as long as the 
implementation matches up with classdoc, it is pretty clear.
   
   > Combined with the newly introduced config, a developer now has to remember 
parsing the value of the config and choose HashClusteredDistribution or 
ClusteredDistribution accordingly, which is some extra burden.
   
   Well, I'd say it is more extra burden if we have to expect two different 
requirements from ClusteredDistribution. Once we understand the difference 
between HashClusteredDistribution and ClusteredDistribution, it is obvious that 
we can easily infer the behavior from which class is used.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on a change in pull request #35574: [SPARK-38237][SQL][SS] Allow `HashPartitioning` to satisfy `ClusteredDistribution` only with full clustering keys

Reply via email to