HeartSaVioR edited a comment on pull request #35574:
URL: https://github.com/apache/spark/pull/35574#issuecomment-1047303366


   Thinking out loud, there could be more ideas to solve this. One rough idea:
   
   The loose requirement of ClusteredDistribution aims to avoid shuffle as many 
as possible, even if the number of partitions are quite small. If we can inject 
the shuffle in runtime (or even crazily, adaptively scaling in running stage) 
based on stats then we can be very adaptive, but I wouldn't expect it to happen 
in near future. Instead, having a threshold (minimum) of the number of 
partitions doesn't sound crazy for me. The threshold could be  heuristic one, 
or config - number or ratio compared to the default number of shuffle 
partitions, or default number of shuffle partitions if we wouldn't want to 
bring another config (but it may be too high to use for minimum).
   
   Rationalization: ClusteredDistribution has a requirement for exact number of 
partitions, but if I checked right, nowhere uses it except AQE. (And it is only 
used for physical node of shuffle.) We simply consider the current partitioning 
as ideal whenever it satisfies the distribution requirement. Adjusting default 
number of shuffle partitions won't take in effect since there is no shuffle, 
and AQE also doesn't help. Having a threshold (minimum) of the number of 
partitions would involve shuffle in many cases where there is an insufficient 
number of partitions. It still doesn't solve the case child has partitioned 
with sub-group keys which unfortunately has a bunch of partitions but skewed. 
But it is really an unusual case we don't have a good idea to pinpoint, and 
probably unavoidable to enforce end users to handle it manually.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to