HeartSaVioR commented on a change in pull request #35574:
URL: https://github.com/apache/spark/pull/35574#discussion_r812324913
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -261,8 +261,16 @@ case class HashPartitioning(expressions: Seq[Expression],
numPartitions: Int)
expressions.length == h.expressions.length &&
expressions.zip(h.expressions).forall {
case (l, r) => l.semanticEquals(r)
}
- case ClusteredDistribution(requiredClustering, _) =>
- expressions.forall(x =>
requiredClustering.exists(_.semanticEquals(x)))
+ case c @ ClusteredDistribution(requiredClustering, _) =>
+ if (SQLConf.get.requireAllClusterKeysForHashPartition) {
Review comment:
In totally ideal world, previous stage could calculate the cardinality
of all required grouping keys in next stage, and once the previous stage
finishes, query executor decides to split the stage based on the difference of
cardinality and the desired threshold of number of partitions (or even the
number of values bound to each key). This is totally ideal and I don't know
whether it is even possible. But if we agree that this is ideal, we are in
agreement that shuffle is not always an evil.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]