[GitHub] [spark] c21 commented on a change in pull request #35574: [SPARK-38237][SQL][SS] Allow `HashPartitioning` to satisfy `ClusteredDistribution` only with full clustering keys

GitBox Tue, 22 Feb 2022 10:52:03 -0800


c21 commented on a change in pull request #35574:
URL: https://github.com/apache/spark/pull/35574#discussion_r812258739




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -261,8 +261,16 @@ case class HashPartitioning(expressions: Seq[Expression], 
numPartitions: Int)
           expressions.length == h.expressions.length && 
expressions.zip(h.expressions).forall {
             case (l, r) => l.semanticEquals(r)
           }
-        case ClusteredDistribution(requiredClustering, _) =>
-          expressions.forall(x => 
requiredClustering.exists(_.semanticEquals(x)))
+        case c @ ClusteredDistribution(requiredClustering, _) =>
+          if (SQLConf.get.requireAllClusterKeysForHashPartition) {

Review comment:
       > I'm concerning that requiring full keys is a valid requirement we 
shouldn't simply drop, at least I see from inputs.
   
   Just to make sure we are on the same page. The operators such as aggregate, 
window are always using `ClusteredDistribution` for years. So the problem of 
having data skew in those operators, is a new problem we are aiming to fix, not 
a regression coming from recent PRs.
   
   > We are technically changing ClusteredDistribution, and we need to make it 
clear. Otherwise this is also going to be an undocumented one. Instead of just 
documenting, I prefer having the flag explicitly so that any partitioning would 
know about the detailed requirement very clear. You don't need to remember the 
config when you work on DataSourcePartitioning. ClusteredDistribution will 
provide it for you.
   
   hmm maybe I am thinking too much from my perspective, but I am still not 
very convinced this is a problem for `ClusteredDistribution`. Hashing on subset 
of keys causes data skew seems to me a problem for `HashPartitioning` only. 
Other `Partitioning` such as `RangePartitioning` or `DataSourcePartitioning` 
can partition data very differently from `HashPartitioning`, or they do not use 
hash at all. So they might have very different causes other than subset of 
keys, to lead to data skew (e.g. suboptimal sampling algorithm for 
`RangePartitioning` to cause bad choice of partition boundary, or suboptimal 
user-defined `DataSourcePartitioning` to cause skew). I am kind of worried 
about introducing a flag such as `requiresFullKeysMatch` in 
`ClusteredDistribution` might be just useful for `HashPartitioning`, but not 
for other `Partitioning` classes. Once we introduce the flag, it's hard to 
change/remove the flag, because other developers or users depend on 
`DataSourcePartitioning` might be b
 roken once we change `ClusteredDistribution` again. So just want to make sure 
we are very cautious about it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] c21 commented on a change in pull request #35574: [SPARK-38237][SQL][SS] Allow `HashPartitioning` to satisfy `ClusteredDistribution` only with full clustering keys

Reply via email to