[GitHub] [spark] sunchao commented on a change in pull request #35574: [SPARK-38237][SQL][SS] Allow `HashPartitioning` to satisfy `ClusteredDistribution` only with full clustering keys

GitBox Mon, 21 Feb 2022 11:39:58 -0800


sunchao commented on a change in pull request #35574:
URL: https://github.com/apache/spark/pull/35574#discussion_r811391237




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -261,8 +261,16 @@ case class HashPartitioning(expressions: Seq[Expression], 
numPartitions: Int)
           expressions.length == h.expressions.length && 
expressions.zip(h.expressions).forall {
             case (l, r) => l.semanticEquals(r)
           }
-        case ClusteredDistribution(requiredClustering, _) =>
-          expressions.forall(x => 
requiredClustering.exists(_.semanticEquals(x)))
+        case c @ ClusteredDistribution(requiredClustering, _) =>
+          if (SQLConf.get.requireAllClusterKeysForHashPartition) {

Review comment:
       Maybe it's just me but the docs were not very crystal clear IIRC. First 
of all, the prefix `Hash` doesn't tell clearly the distinction between the two. 
Yes, it sounds like `HashClusteredDistribution` can only be satisfied by 
`HashPartitioning`, but how does it relates to the fact that Spark requires 
full partition keys match on `HashClusteredDistribution` while only requires 
partial match on `ClusteredDistribution`? Also, even though the doc says "This 
is a strictly stronger guarantee than ClusteredDistribution`, the "strict" here 
is a bit vague and I had to navigate to `HashPartitioning.satisfies` to find 
out the differences. There was also no context why `HashClusteredDistribution` 
is only used in joins while `ClusteredDistribution` is used in aggregates, etc.
   
   > Well, I'd say it is more extra burden if we have to expect two different 
requirements from ClusteredDistribution. 
   
   Again, think from Spark developers' point of view (for instance, someone 
who's creating a new operator). They have to first remember that these two 
configs exist, understand the differences between the two distributions and 
then write code to choose them accordingly, while in most cases it should just 
be transparent. This not only applies to developers on Spark itself but also 
those who're developing Spark extensions that replace Spark operators with 
their own.
   
   By limiting the behavior only in config and `Partitioning` etc, the number 
of audiences affected by this change is much less:`Partitioning` is essentially 
a sealed trait so one will only be able to add new sub-classes within 
`partitioning.scala`. And yes, I'd certainly remember to check these for 
`DataSourcePartitioning` that I'm working on :)
   
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sunchao commented on a change in pull request #35574: [SPARK-38237][SQL][SS] Allow `HashPartitioning` to satisfy `ClusteredDistribution` only with full clustering keys

Reply via email to