HeartSaVioR edited a comment on pull request #35574: URL: https://github.com/apache/spark/pull/35574#issuecomment-1047303366
Thinking out loud, there could be more ideas to solve this. One rough idea: The loose requirement of ClusteredDistribution aims to avoid shuffle as many as possible, even if the number of partitions are quite small. If we can inject the shuffle in runtime (or even crazily, adaptively scaling in running stage) based on stats then we can be very adaptive, but I wouldn't expect it to happen in near future. Instead, having a threshold (minimum) of the number of partitions doesn't sound crazy for me. The threshold could be heuristic one, or config - number or ratio compared to the default number of shuffle partitions, or default number of shuffle partitions if we wouldn't want to bring another config (but it may be too high to use for minimum). Rationalization: ClusteredDistribution has a requirement for exact number of partitions, but if I checked right, nowhere uses it except AQE. (And it is only used for physical node of shuffle.) We simply consider the current partitioning as ideal whenever it satisfies the distribution requirement. Adjusting default number of shuffle partitions won't take in effect since there is no shuffle, and AQE also doesn't help. Having a threshold (minimum) of the number of partitions would involve shuffle in many cases where there is an insufficient number of partitions. It still doesn't solve the case child has partitioned with sub-group keys which unfortunately has a bunch of partitions but skewed. But it is really an unusual case we don't have a good idea to pinpoint, and probably unavoidable to enforce end users to handle it manually. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
