sunchao commented on a change in pull request #35574:
URL: https://github.com/apache/spark/pull/35574#discussion_r810661151
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -271,6 +279,17 @@ case class HashPartitioning(expressions: Seq[Expression],
numPartitions: Int)
override def createShuffleSpec(distribution: ClusteredDistribution):
ShuffleSpec =
HashShuffleSpec(this, distribution)
+ /**
+ * Checks if [[HashPartitioning]] is partitioned on exactly same full
`clustering` keys of
+ * [[ClusteredDistribution]].
+ */
+ def isPartitionedOnFullKeys(distribution: ClusteredDistribution): Boolean = {
+ expressions.length == distribution.clustering.length &&
Review comment:
I'm not sure if ordering is important here: is it a common case that
data skewness is introduced after changing the order the hash keys? I'm
surprised if murmur3 hash exhibits this kind of property.
This also makes the optimization harder to kick in (imagine users have to
carefully align join or aggregation keys to the same order as in bucket keys in
the table). It is also a behavior change of bucket join, since currently Spark
reorders the hash keys w.r.t join keys in
`EnsureRequirements.reorderJoinPredicates`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]