[
https://issues.apache.org/jira/browse/SPARK-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14130974#comment-14130974
]
Nicholas Chammas commented on SPARK-2045:
-----------------------------------------
The [1.1.0 release
notes|http://spark.eu.apache.org/releases/spark-release-1-1-0.html] call out
this change:
{quote}
"This “sort-based shuffle” will be become the default in the next release, and
is now available to users. For jobs with large numbers of reducers, we
recommend turning this on."
{quote}
Is turning this on just a matter of setting {{spark.shuffle.manager}} to
{{SORT}}?
> Sort-based shuffle implementation
> ---------------------------------
>
> Key: SPARK-2045
> URL: https://issues.apache.org/jira/browse/SPARK-2045
> Project: Spark
> Issue Type: New Feature
> Components: Shuffle, Spark Core
> Reporter: Matei Zaharia
> Assignee: Matei Zaharia
> Fix For: 1.1.0
>
> Attachments: Sort-basedshuffledesign.pdf
>
>
> Building on the pluggability in SPARK-2044, a sort-based shuffle
> implementation that takes advantage of an Ordering for keys (or just sorts by
> hashcode for keys that don't have it) would likely improve performance and
> memory usage in very large shuffles. Our current hash-based shuffle needs an
> open file for each reduce task, which can fill up a lot of memory for
> compression buffers and cause inefficient IO. This would avoid both of those
> issues.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]