cloud-fan commented on PR #38799:
URL: https://github.com/apache/spark/pull/38799#issuecomment-1329040563

   This PR basically adds a per-window-group limit before and after the shuffle 
to reduce the input data of window processing.
   
   More specifically, the before-shuffle per-window-group limit:
   1. adds an extra local sort to determine window group boundaries
   2. applies per-group limit to reduce the data size of shuffle, and all the 
downstream operators.
   
   This is beneficial, assuming the per-group data size is large. Otherwise, 
the extra local sort is pure overhead.
   
   The after-shuffle per-window-group limit just applies a per-group limit to 
reduce the data size of window processing. This is very cheap as it only needs 
to iterate the sorted data (window operator needs to sort the input) once and 
do some row comparison to determine group boundaries and rank values. It's more 
efficient to merge it into the window operator, but probably doesn't worth it 
as the overhead is small.
   
   I think the key here is to make sure the before-shuffle per-group limit is 
very unlikely to introduce regressions. It's very hard to know the per-group 
data size ahead of time (can CBO help?), and we need some heuristics to trigger 
it. Some thoughts:
   1. shall we add a config as the threshold of the limit? e.g. we can set the 
config as 10 and then `where rn = 11` won't trigger it.
   2. shall we have a special sort that stops earlier when the input is very 
distinct? This means the per-group data size is small and we shouldn't do this 
optimization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to