viirya commented on pull request #35552: URL: https://github.com/apache/spark/pull/35552#issuecomment-1043228323
> We figured out that HashClusteredDistribution is still desirable in some cases even without stateful operators; HashPartitioning with subset of grouping keys can satisfy ClusteredDistribution, which means the cardinality of the subset of grouping keys technically defines the max parallelism. Increasing the number of partitions does not always help to solve the skew of the partitions. I think this is understandable. It'd be better if you can provide an example in the description. But I'm bit confused that how it links to this renaming effort. Do you mean because `StatefulOpClusteredDistribution` is not only for stateful operation, so you propose to rename it back? As it was removed and renamed before, do we have any place that needs to use `HashClusteredDistribution` now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
