HeartSaVioR commented on a change in pull request #35574:
URL: https://github.com/apache/spark/pull/35574#discussion_r810708212



##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -261,8 +261,16 @@ case class HashPartitioning(expressions: Seq[Expression], 
numPartitions: Int)
           expressions.length == h.expressions.length && 
expressions.zip(h.expressions).forall {
             case (l, r) => l.semanticEquals(r)
           }
-        case ClusteredDistribution(requiredClustering, _) =>
-          expressions.forall(x => 
requiredClustering.exists(_.semanticEquals(x)))
+        case c @ ClusteredDistribution(requiredClustering, _) =>
+          if (SQLConf.get.requireAllClusterKeysForHashPartition) {

Review comment:
       We already made a single exception 
(`spark.sql.requireAllClusterKeysForCoPartition`), and we are going to make a 
broader exception here (`spark.sql.requireAllClusterKeysForHashPartition`).
   (EDIT: shouldn't we want this for data source partitioning as well?)
   
   There have been valid cases to support these exceptions, and they are not 
edge-cases like from odd data distribution. That said, this is going to be a 
valid requirement for distribution. In other words, there are known cases 
ClusteredDistribution solely does not solve the problem nicely. It is trying 
hard to eliminate shuffle as many as possible (assuming that shuffle is evil), 
but it is doing nothing for the cases shuffle does help.
   
   So I don't think adding flag in ClusteredDistribution is messing up 
interface and structure. I think it is opposite. We are making exceptions for 
requirement of ClusteredDistribution - requiring full keys is not a one of 
requirements of ClusteredDistribution as of now (and even with this PR), right? 
We haven't documented and it is now completely depending on the implementation 
of HashPartitioning. If someone starts to look into ClusteredDistribution, it 
is likely that someone misses the case. It is also possible we miss the config 
when implementing DataSourcePartitioning against ClusteredDistribution. I said 
"trade-off" because it pinpoints the issue and tries to bring a small amount of 
fix which may be preferable for someone, but my preference is making it clearer.
   
   If we are skeptical to address this in ClusteredDistribution because we 
don't want to make requirement of ClusteredDistribution be extended further, 
this is really a good rationalization we should revive back 
HashClusteredDistribution because the requirement is 100% fit to what we are 
doing. The only difference is that data source partitioning wouldn't satisfy 
the requirement in any way which may be considered as major downside according 
to the roadmap, so yes it brings data source partitioning as a second class for 
some cases. If we are against of it, please make sure ClusteredDistribution 
covers everything "by definition" HashClusteredDistribution could cover it 
solely.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to