sunchao commented on a change in pull request #35574:
URL: https://github.com/apache/spark/pull/35574#discussion_r811391237
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -261,8 +261,16 @@ case class HashPartitioning(expressions: Seq[Expression],
numPartitions: Int)
expressions.length == h.expressions.length &&
expressions.zip(h.expressions).forall {
case (l, r) => l.semanticEquals(r)
}
- case ClusteredDistribution(requiredClustering, _) =>
- expressions.forall(x =>
requiredClustering.exists(_.semanticEquals(x)))
+ case c @ ClusteredDistribution(requiredClustering, _) =>
+ if (SQLConf.get.requireAllClusterKeysForHashPartition) {
Review comment:
Maybe it's just me but the docs were not very crystal clear IIRC. First
of all, the prefix `Hash` doesn't tell clearly the distinction between the two.
Yes, it sounds like `HashClusteredDistribution` can only be satisfied by
`HashPartitioning`, but how does it relates to the fact that Spark requires
full partition keys match on `HashClusteredDistribution` while only requires
partial match on `ClusteredDistribution`? Also, even though the doc says "This
is a strictly stronger guarantee than ClusteredDistribution`, the "strict" here
is a bit vague and I had to navigate to `HashPartitioning.satisfies` to find
out the differences. There was also no context why `HashClusteredDistribution`
is only used in joins while `ClusteredDistribution` is used in aggregates, etc.
> Well, I'd say it is more extra burden if we have to expect two different
requirements from ClusteredDistribution.
Again, think from Spark developers' point of view (for instance, someone
who's creating a new operator). They have to first remember that these two
configs exist, understand the differences between the two distributions and
then write code to choose them accordingly, while in most cases it should just
be transparent. This not only applies to developers on Spark itself but also
those who're developing Spark extensions that replace Spark operators with
their own.
By limiting the behavior only in config and `Partitioning` etc, the number
of audiences affected by this change is much less:`Partitioning` is essentially
a sealed trait so one will only be able to add new sub-classes within
`partitioning.scala`. And yes, I'd certainly remember to check these for
`DataSourcePartitioning` that I'm working on :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]