HeartSaVioR commented on a change in pull request #35574:
URL: https://github.com/apache/spark/pull/35574#discussion_r810708212
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -261,8 +261,16 @@ case class HashPartitioning(expressions: Seq[Expression],
numPartitions: Int)
expressions.length == h.expressions.length &&
expressions.zip(h.expressions).forall {
case (l, r) => l.semanticEquals(r)
}
- case ClusteredDistribution(requiredClustering, _) =>
- expressions.forall(x =>
requiredClustering.exists(_.semanticEquals(x)))
+ case c @ ClusteredDistribution(requiredClustering, _) =>
+ if (SQLConf.get.requireAllClusterKeysForHashPartition) {
Review comment:
We already made a single exception
(`spark.sql.requireAllClusterKeysForCoPartition`), and we are going to make a
broader exception here (`spark.sql.requireAllClusterKeysForHashPartition`).
(EDIT: shouldn't we want this for data source partitioning as well?)
There have been valid cases to support these exceptions, and they are not
edge-cases like from odd data distribution. That said, this is going to be a
valid requirement for distribution. In other words, there are known cases
ClusteredDistribution solely does not solve the problem nicely. It is trying
hard to eliminate shuffle as many as possible (assuming that shuffle is evil),
but it is doing nothing for the cases shuffle does help.
So I don't think adding flag in ClusteredDistribution is messing up
interface and structure. I think it is opposite. We are making exceptions for
requirement of ClusteredDistribution - requiring full keys are not a one of
requirements of ClusteredDistribution as of now (and even with this PR), right?
We haven't documented and it is now completely depending on the implementation
of HashPartitioning. If someone starts to look into ClusteredDistribution, it
is likely that someone misses the case. It is also possible we miss the config
when implementing DataSourcePartitioning against ClusteredDistribution. I said
"trade-off" because it pinpoints the issue and tries to bring a small amount of
fix which may be preferable for someone, but my preference is making it clearer.
If we are skeptical to address this in ClusteredDistribution because we
don't want to make requirement of ClusteredDistribution be extended further,
this is really a good rationalization we should revive back
HashClusteredDistribution because the requirement is 100% fit to what we are
doing. The only difference is that data source partitioning wouldn't satisfy
the requirement in any way which may be considered as major downside according
to the roadmap, so yes it brings data source partitioning as a second class for
some cases. If we are against of it, please make sure ClusteredDistribution
covers everything "by definition" HashClusteredDistribution could cover it
solely.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]