[ 
https://issues.apache.org/jira/browse/SPARK-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063009#comment-14063009
 ] 

Matei Zaharia commented on SPARK-2045:
--------------------------------------

Right now I was thinking it would happen using an ExternalAppendOnlyMap before 
we feed values into the SortedFileWriter. This is not ideal though, and it 
would be good to put it in the SortedFileWriter as we merge files, which would 
require that to be aware of how to compare keys (or use hash codes if that 
fails, like the EAOM). But I'll probably start with this initially and then see 
how complex it is to do it the other way. I think this would still improve 
performance over the current setup.

I'll try to post some code soon so that some of these improvements can be done 
in parallel.

> Sort-based shuffle implementation
> ---------------------------------
>
>                 Key: SPARK-2045
>                 URL: https://issues.apache.org/jira/browse/SPARK-2045
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Matei Zaharia
>         Attachments: Sort-basedshuffledesign.pdf
>
>
> Building on the pluggability in SPARK-2044, a sort-based shuffle 
> implementation that takes advantage of an Ordering for keys (or just sorts by 
> hashcode for keys that don't have it) would likely improve performance and 
> memory usage in very large shuffles. Our current hash-based shuffle needs an 
> open file for each reduce task, which can fill up a lot of memory for 
> compression buffers and cause inefficient IO. This would avoid both of those 
> issues.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to