[ 
https://issues.apache.org/jira/browse/SPARK-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019427#comment-14019427
 ] 

Mridul Muralidharan commented on SPARK-2045:
--------------------------------------------

The plan Tom and I had was to see if we can modify and adopt hadoop's shuffle 
to provide this functionality : secondary sort was an interesting side effect 
we might get - but not primary goal.

Also, we were planning to investigate whether we can use MR's approach to how 
shuffle is stored as well.
This became lower priority as the 2G fix proceeded (since the impact of current 
design was alleviated indirectly by that), but would still be useful to 
investigate.

> Sort-based shuffle implementation
> ---------------------------------
>
>                 Key: SPARK-2045
>                 URL: https://issues.apache.org/jira/browse/SPARK-2045
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Matei Zaharia
>
> Building on the pluggability in SPARK-2044, a sort-based shuffle 
> implementation that takes advantage of an Ordering for keys (or just sorts by 
> hashcode for keys that don't have it) would likely improve performance and 
> memory usage in very large shuffles. Our current hash-based shuffle needs an 
> open file for each reduce task, which can fill up a lot of memory for 
> compression buffers and cause inefficient IO. This would avoid both of those 
> issues.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to