c21 commented on a change in pull request #35574:
URL: https://github.com/apache/spark/pull/35574#discussion_r812395628
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -261,8 +261,16 @@ case class HashPartitioning(expressions: Seq[Expression],
numPartitions: Int)
expressions.length == h.expressions.length &&
expressions.zip(h.expressions).forall {
case (l, r) => l.semanticEquals(r)
}
- case ClusteredDistribution(requiredClustering, _) =>
- expressions.forall(x =>
requiredClustering.exists(_.semanticEquals(x)))
+ case c @ ClusteredDistribution(requiredClustering, _) =>
+ if (SQLConf.get.requireAllClusterKeysForHashPartition) {
Review comment:
> Shuffle may not be something we should try hard to eliminate at all.
We also need to think when the shuffle would be likely help. We can't leverage
stats in physical execution, so my fall back goes to be heuristic, like the
different view and resolution on this problem
https://github.com/apache/spark/pull/35574#issuecomment-1047303366.
> Instead, having a threshold (minimum) of the number of partitions doesn't
sound crazy for me. The threshold could be heuristic one, or config - number or
ratio compared to the default number of shuffle partitions, or default number
of shuffle partitions if we wouldn't want to bring another config (but it may
be too high to use for minimum).
@HeartSaVioR - First I agree with you sometime shuffle is good to have, so I
guess this PR is aiming for the same goal - add proper shuffle on full
clustering keys based on config, right? Just for my understanding, are you
proposing above to have a config to set the minimal threshold for number of
partitions for all queries needed shuffle? Can you elaborate more how query and
which part of query would be rewritten if violating the config? With
cardinality stats from CBO and AQE (we don't have cardinality stats collected
in AQE for now), we may potentially give some hint in query plan during logical
planning. But this approach still sounds a little bit too high level for me,
without elaborating details of algorithm.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]