Re: [PR] [SPARK-46512][CORE] Optimize shuffle reading when both sort and combine are used. [spark]

via GitHub Sun, 25 Feb 2024 17:17:56 -0800


zhengchenyu commented on PR #44512:
URL: https://github.com/apache/spark/pull/44512#issuecomment-1963155578


   @waitinfuture 
   (1) performance differs
   Experimental data were recorded in 
https://issues.apache.org/jira/browse/SPARK-46512. How much performance 
improvement depends on the experimental environment. In my experiments, the 
correlation time was reduced from 75 seconds to 29 seconds.
   
   (2) total number of spilling differs
   The reduction in the number of spills is obvious. 
   Before this PR, we combine the unsorted records, then sort. When we combine 
the unsorted records, we us ExternalAppendOnlyMap. They may spill for large 
data. Then when we sort, we still spill for large data.
   After this PR, when we sort, we can easily organize the same keys together, 
and then we no longer have to use ExternalAppendOnlyMap to combine.
   After this PR, we will save the spilled process which we must do in combine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-46512][CORE] Optimize shuffle reading when both sort and combine are used. [spark]

Reply via email to