HeartSaVioR edited a comment on pull request #35574:
URL: https://github.com/apache/spark/pull/35574#issuecomment-1047303366


   Thinking out loud, there could be more ideas to solve this. One rough idea:
   
   The loose requirement of ClusteredDistribution aims to avoid shuffle as many 
as possible, even if the number of partitions are quite small. If we can inject 
the shuffle in runtime based on stats then we can be very adaptive, but I 
wouldn't expect it. Instead, having a threshold (minimum) of the number of 
partitions (either heuristically or config, or default number of shuffle 
partitions if we wouldn't want to bring another config) doesn't sound crazy for 
me. 
   
   Rationalization: ClusteredDistribution has a requirement for exact number of 
partitions, but if I checked right, nowhere uses it except AQE. (And it is only 
used for physical node of shuffle.) We simply consider the current partitioning 
as ideal whenever it satisfies the distribution requirement. Adjusting default 
number of shuffle partitions won't take in effect since there is no shuffle, 
and AQE also doesn't help. Having a threshold (minimum) of the number of 
partitions would involve shuffle in many cases where there is an insufficient 
number of partitions. It still doesn't solve the case child has partitioned 
with sub-group keys which unfortunately has a bunch of partitions but skewed. 
But it is really an unusual case and probably OK to enforce end users to handle 
it manually.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to