[ 
https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123390#comment-14123390
 ] 

Andrew Ash commented on SPARK-3280:
-----------------------------------

[~joshrosen] do you have a theory for the cause of the dropoff between 2800 and 
3200 partitions in your chart?  My interpretation is that both shuffle 
implementations behave similarly in this scenario up to ~1600 after which the 
hash based starts falling behind, then there's another step difference at 3200 
where it hits a severe dropoff.  I'm interested in the right third of the chart.

A couple theories:
- more partitions = more stuff in memory concurrently = GC pressure.  
Sort-based can stream and do merge sort, but hash-based needs to build the hash 
all at once then spill it
- more partitions = more concurrent spills = disk thrashing while writing to 
lots of files concurrently, exacerbated if the test was on spinnies instead of 
SSDs.  Maybe the sort-based merges spills while writing to disk so ends up 
writing fewer spill files concurrently.

Also the chart is a little unclear, is the y-axis time in seconds?

> Made sort-based shuffle the default implementation
> --------------------------------------------------
>
>                 Key: SPARK-3280
>                 URL: https://issues.apache.org/jira/browse/SPARK-3280
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Reynold Xin
>            Assignee: Reynold Xin
>         Attachments: hash-sort-comp.png
>
>
> sort-based shuffle has lower memory usage and seems to outperform hash-based 
> in almost all of our testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to