[ 
https://issues.apache.org/jira/browse/SPARK-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063037#comment-14063037
 ] 

Saisai Shao commented on SPARK-2045:
------------------------------------

IMHO I think in this situation we should manage two buffers: one is EAOM, and 
another is SortedWriter. 

I think if we use EAOM, we can set a custom comparator in which first compare 
partitions, and then compare keys or hashcode of key if necessary, so we can 
get the iterator of EAOM sorted by partition and key (if necessary), and can be 
written to single file without SortedFileWriter.

Another way if we use SortedFileWriter, it would be better to aggregate at the 
merging steps without EAOM, though a little complex.

Sorry for any misunderstanding.

> Sort-based shuffle implementation
> ---------------------------------
>
>                 Key: SPARK-2045
>                 URL: https://issues.apache.org/jira/browse/SPARK-2045
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Matei Zaharia
>         Attachments: Sort-basedshuffledesign.pdf
>
>
> Building on the pluggability in SPARK-2044, a sort-based shuffle 
> implementation that takes advantage of an Ordering for keys (or just sorts by 
> hashcode for keys that don't have it) would likely improve performance and 
> memory usage in very large shuffles. Our current hash-based shuffle needs an 
> open file for each reduce task, which can fill up a lot of memory for 
> compression buffers and cause inefficient IO. This would avoid both of those 
> issues.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to