[ 
https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099136#comment-14099136
 ] 

Patrick Wendell commented on SPARK-2044:
----------------------------------------

A lot of this has been fixed in 1.1 so I moved target version to 1.2. [~matei] 
we can also close this with fixVersion=1.1.0 if you consider the initial issue 
fixed.

> Pluggable interface for shuffles
> --------------------------------
>
>                 Key: SPARK-2044
>                 URL: https://issues.apache.org/jira/browse/SPARK-2044
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle, Spark Core
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>         Attachments: Pluggableshuffleproposal.pdf
>
>
> Given that a lot of the current activity in Spark Core is in shuffles, I 
> wanted to propose factoring out shuffle implementations in a way that will 
> make experimentation easier. Ideally we will converge on one implementation, 
> but for a while, this could also be used to have several implementations 
> coexist. I'm suggesting this because I aware of at least three efforts to 
> look at shuffle (from Yahoo!, Intel and Databricks). Some of the things 
> people are investigating are:
> * Push-based shuffle where data moves directly from mappers to reducers
> * Sorting-based instead of hash-based shuffle, to create fewer files (helps a 
> lot with file handles and memory usage on large shuffles)
> * External spilling within a key
> * Changing the level of parallelism or even algorithm for downstream stages 
> at runtime based on statistics of the map output (this is a thing we had 
> prototyped in the Shark research project but never merged in core)
> I've attached a design doc with a proposed interface. It's not too crazy 
> because the interface between shuffles and the rest of the code is already 
> pretty narrow (just some iterators for reading data and a writer interface 
> for writing it). Bigger changes will be needed in the interaction with 
> DAGScheduler and BlockManager for some of the ideas above, but we can handle 
> those separately, and this interface will allow us to experiment with some 
> short-term stuff sooner.
> If things go well I'd also like to send a sort-based shuffle implementation 
> for 1.1, but we'll see how the timing on that works out.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to