[jira] [Created] (SPARK-2044) Pluggable interface for shuffles

Matei Zaharia (JIRA) Thu, 05 Jun 2014 16:09:29 -0700

Matei Zaharia created SPARK-2044:
------------------------------------

             Summary: Pluggable interface for shuffles
                 Key: SPARK-2044
                 URL: https://issues.apache.org/jira/browse/SPARK-2044
             Project: Spark
          Issue Type: Improvement
          Components: Shuffle, Spark Core
            Reporter: Matei Zaharia
            Assignee: Matei Zaharia
         Attachments: Pluggableshuffleproposal.pdf


Given that a lot of the current activity in Spark Core is in shuffles, I wanted 
to propose factoring out shuffle implementations in a way that will make 
experimentation easier. Ideally we will converge on one implementation, but for 
a while, this could also be used to have several implementations coexist. I'm 
suggesting this because I aware of at least three efforts to look at shuffle 
(from Yahoo!, Intel and Databricks). Some of the things people are 
investigating are:
* Push-based shuffle where data moves directly from mappers to reducers
* Sorting-based instead of hash-based shuffle, to create fewer files (helps a 
lot with file handles and memory usage on large shuffles)
* External spilling within a key
* Changing the level of parallelism or even algorithm for downstream stages at 
runtime based on statistics of the map output (this is a thing we had 
prototyped in the Shark research project but never merged in core)

I've attached a design doc with a proposed interface. It's not too crazy 
because the interface between shuffles and the rest of the code is already 
pretty narrow (just some iterators for reading data and a writer interface for 
writing it). Bigger changes will be needed in the interaction with DAGScheduler 
and BlockManager for some of the ideas above, but we can handle those 
separately, and this interface will allow us to experiment with some short-term 
stuff sooner.

If things go well I'd also like to send a sort-based shuffle implementation for 
1.1, but we'll see how the timing on that works out.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2044) Pluggable interface for shuffles

Reply via email to