Re: [PR] [SPARK-46512][CORE] Optimize shuffle reading when both sort and combine are used. [spark]

via GitHub Sun, 25 Feb 2024 17:40:07 -0800


waitinfuture commented on PR #44512:
URL: https://github.com/apache/spark/pull/44512#issuecomment-1963172789


   > @waitinfuture (1) performance differs Experimental data were recorded in 
https://issues.apache.org/jira/browse/SPARK-46512. How much performance 
improvement depends on the experimental environment. In my experiments, the 
correlation time was reduced from 75 seconds to 29 seconds.
   > 
   > (2) total number of spilling differs The reduction in the number of spills 
is obvious. Before this PR, we combine the unsorted records, then sort. When we 
combine the unsorted records, we us ExternalAppendOnlyMap. They may spill for 
large data. Then when we sort, we still spill for large data. After this PR, 
when we sort, we can easily organize the same keys together, and then we no 
longer have to use ExternalAppendOnlyMap to combine. After this PR, we will 
save the spilled process which we must do in combine.
   
   Got it, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-46512][CORE] Optimize shuffle reading when both sort and combine are used. [spark]

Reply via email to