[GitHub] [spark] sunchao commented on a change in pull request #35574: [SPARK-38237][SQL][SS] Allow `HashPartitioning` to satisfy `ClusteredDistribution` only with full clustering keys

GitBox Sun, 20 Feb 2022 10:13:19 -0800


sunchao commented on a change in pull request #35574:
URL: https://github.com/apache/spark/pull/35574#discussion_r810661151




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -271,6 +279,17 @@ case class HashPartitioning(expressions: Seq[Expression], 
numPartitions: Int)
   override def createShuffleSpec(distribution: ClusteredDistribution): 
ShuffleSpec =
     HashShuffleSpec(this, distribution)
 
+  /**
+   * Checks if [[HashPartitioning]] is partitioned on exactly same full 
`clustering` keys of
+   * [[ClusteredDistribution]].
+   */
+  def isPartitionedOnFullKeys(distribution: ClusteredDistribution): Boolean = {
+    expressions.length == distribution.clustering.length &&

Review comment:
       I'm not sure if ordering is important here: is it a common case that 
data skewness is introduced after changing the order the hash keys? I'm 
surprised if murmur3 hash exhibits this kind of property.
   
   This also makes the optimization harder to kick in (imagine users have to 
carefully align join or aggregation keys to the same order as that of bucket 
keys in the table). It is also a behavior change of bucket join, since 
currently Spark is more relaxed and will reorder the hash keys w.r.t join keys 
in `EnsureRequirements.reorderJoinPredicates`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sunchao commented on a change in pull request #35574: [SPARK-38237][SQL][SS] Allow `HashPartitioning` to satisfy `ClusteredDistribution` only with full clustering keys

Reply via email to