zhengchenyu commented on PR #44512:
URL: https://github.com/apache/spark/pull/44512#issuecomment-1871886874

   @Ngone51 
   Thanks for you review! The combine is the key problem. 
   I still combine, but combine in ExternalSorter will never trigger extra 
spill. Combine happen in insertAllAndUpdateMetrics and merge:
   (1) combine when insert, just read the exist key and combine the value, do 
it in memory. 
   (2) combine when merge, we sort the combined iterator, combine util next key 
occur, also do it in memory.
   The two processes are not additional operations and are originally performed 
in the ExternalSorter.
   
   Before the change, We use ExternalAppendOnlyMap to combine. If the memory is 
over threshold, will spill to disk. Then we use ExternalSorter to sort. If the 
memory is over threshold, will spill to disk. **It means we may spill 
twice(ExternalAppendOnlyMap and ExternalSorter), if shuffle data is huge.**
   
   We know that we can combine during the sorting process. 
   After the change, we will use ExternalSorter, combine occurs during insert 
and merge processes. So there is no need to call ExternalAppendOnlyMap to 
combine. **It means only spill once (ExternalSorter).**
   
   Spill is time consuming. For sort and combine shuffle, when shuffle date is 
huge, this PR will save time。
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to