[ 
https://issues.apache.org/jira/browse/SPARK-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2045:
---------------------------------

    Attachment: Sort-basedshuffledesign.pdf

I've posted a design doc for a simple version of this. This is based on input 
from Sandy Ryza as well, who had been looking at a prototype by Saisai Shao, so 
I know several people had been thinking about this. I'd like to get this simple 
version into 1.1, and then there's room for more optimizations on top (pointed 
out in the doc).

Note that our shuffle is a bit different from MapReduce's for a few reasons, 
e.g. we don't require all objects to be Writable, so we can't make as many 
assumptions about the serializer, and many of our operators don't need sorted 
data. This design aims to deal with that while solving the main problem with 
hash-based shuffle (lots of open files and buffers). In the future we can add 
options for sorting the data as it goes along.

> Sort-based shuffle implementation
> ---------------------------------
>
>                 Key: SPARK-2045
>                 URL: https://issues.apache.org/jira/browse/SPARK-2045
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Matei Zaharia
>
> Building on the pluggability in SPARK-2044, a sort-based shuffle 
> implementation that takes advantage of an Ordering for keys (or just sorts by 
> hashcode for keys that don't have it) would likely improve performance and 
> memory usage in very large shuffles. Our current hash-based shuffle needs an 
> open file for each reduce task, which can fill up a lot of memory for 
> compression buffers and cause inefficient IO. This would avoid both of those 
> issues.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to