[
https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123390#comment-14123390
]
Andrew Ash commented on SPARK-3280:
-----------------------------------
[~joshrosen] do you have a theory for the cause of the dropoff between 2800 and
3200 partitions in your chart? My interpretation is that both shuffle
implementations behave similarly in this scenario up to ~1600 after which the
hash based starts falling behind, then there's another step difference at 3200
where it hits a severe dropoff. I'm interested in the right third of the chart.
A couple theories:
- more partitions = more stuff in memory concurrently = GC pressure.
Sort-based can stream and do merge sort, but hash-based needs to build the hash
all at once then spill it
- more partitions = more concurrent spills = disk thrashing while writing to
lots of files concurrently, exacerbated if the test was on spinnies instead of
SSDs. Maybe the sort-based merges spills while writing to disk so ends up
writing fewer spill files concurrently.
Also the chart is a little unclear, is the y-axis time in seconds?
> Made sort-based shuffle the default implementation
> --------------------------------------------------
>
> Key: SPARK-3280
> URL: https://issues.apache.org/jira/browse/SPARK-3280
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Reporter: Reynold Xin
> Assignee: Reynold Xin
> Attachments: hash-sort-comp.png
>
>
> sort-based shuffle has lower memory usage and seems to outperform hash-based
> in almost all of our testing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]