[GitHub] [spark] viirya commented on pull request #35552: [SPARK-38237][SQL][SS] Rename back StatefulOpClusteredDistribution to HashClusteredDistribution

GitBox Thu, 17 Feb 2022 09:39:17 -0800


viirya commented on pull request #35552:
URL: https://github.com/apache/spark/pull/35552#issuecomment-1043228323



   > We figured out that HashClusteredDistribution is still desirable in some 
cases even without stateful operators; HashPartitioning with subset of grouping 
keys can satisfy ClusteredDistribution, which means the cardinality of the 
subset of grouping keys technically defines the max parallelism. Increasing the 
number of partitions does not always help to solve the skew of the partitions.
   
   I think this is understandable. It'd be better if you can provide an example 
in the description. But I'm bit confused that how it links to this renaming 
effort. Do you mean because `StatefulOpClusteredDistribution` is not only for 
stateful operation, so you propose to rename it back? As it was removed and 
renamed before, do we have any place that needs to use 
`HashClusteredDistribution` now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] viirya commented on pull request #35552: [SPARK-38237][SQL][SS] Rename back StatefulOpClusteredDistribution to HashClusteredDistribution

Reply via email to