c21 commented on a change in pull request #35574:
URL: https://github.com/apache/spark/pull/35574#discussion_r812258739
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -261,8 +261,16 @@ case class HashPartitioning(expressions: Seq[Expression],
numPartitions: Int)
expressions.length == h.expressions.length &&
expressions.zip(h.expressions).forall {
case (l, r) => l.semanticEquals(r)
}
- case ClusteredDistribution(requiredClustering, _) =>
- expressions.forall(x =>
requiredClustering.exists(_.semanticEquals(x)))
+ case c @ ClusteredDistribution(requiredClustering, _) =>
+ if (SQLConf.get.requireAllClusterKeysForHashPartition) {
Review comment:
> I'm concerning that requiring full keys is a valid requirement we
shouldn't simply drop, at least I see from inputs.
Just to make sure we are on the same page. The operators such as aggregate,
window are always using `ClusteredDistribution` for years. So the problem of
having data skew in those operators, is a new problem we are aiming to fix, not
a regression coming from recent PRs.
> We are technically changing ClusteredDistribution, and we need to make it
clear. Otherwise this is also going to be an undocumented one. Instead of just
documenting, I prefer having the flag explicitly so that any partitioning would
know about the detailed requirement very clear. You don't need to remember the
config when you work on DataSourcePartitioning. ClusteredDistribution will
provide it for you.
hmm maybe I am thinking too much from my perspective, but I am still not
very convinced this is a problem for `ClusteredDistribution`. Hashing on subset
of keys causes data skew seems to me a problem for `HashPartitioning` only.
Other `Partitioning` such as `RangePartitioning` or `DataSourcePartitioning`
can partition data very differently from `HashPartitioning`, or they do not use
hash at all. So they might have very different causes other than subset of
keys, to lead to data skew (e.g. suboptimal sampling algorithm for
`RangePartitioning` to cause bad choice of partition boundary, or suboptimal
user-defined `DataSourcePartitioning` to cause skew). I am kind of worried
about introducing a flag such as `requiresFullKeysMatch` in
`ClusteredDistribution` might be just useful for `HashPartitioning`, but not
for other `Partitioning` classes. Once we introduce the flag, it's hard to
change/remove the flag, because other developers or users depend on
`DataSourcePartitioning` might be b
roken once we change `ClusteredDistribution` again. So just want to make sure
we are very cautious about it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]