HeartSaVioR edited a comment on pull request #35574:
URL: https://github.com/apache/spark/pull/35574#issuecomment-1047303366


   Thinking out loud, there could be more ideas to solve this. One rough idea:
   
   The loose requirement of ClusteredDistribution aims to avoid shuffle as many 
as possible, even if the number of partitions are quite small. If we can inject 
the shuffle in runtime based on stats then we can be very adaptive, but I 
wouldn't expect it. Instead, having a threshold (minimum) of the number of 
partitions (either heuristically or config, or default number of shuffle 
partitions if we wouldn't want to bring another config) doesn't sound crazy for 
me. 
   
   Rationalization: ClusteredDistribution has a requirement for exact number of 
partitions, but if I checked right, nowhere uses it except AQE. (And it is only 
used for physical node of shuffle.) We simply consider the current partitioning 
as ideal whenever it satisfies the distribution requirement. Adjusting default 
number of shuffle partitions won't take in effect since there is no shuffle. 
Having a threshold (minimum) of the number of partitions would involve shuffle 
in many cases where there is an insufficient number of partitions. It still 
doesn't solve the case child has partitioned with sub-group keys which 
unfortunately has a bunch of partitions but skewed. But it is really an unusual 
case and probably OK to enforce end users to handle it manually.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to