[
https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099136#comment-14099136
]
Patrick Wendell commented on SPARK-2044:
----------------------------------------
A lot of this has been fixed in 1.1 so I moved target version to 1.2. [~matei]
we can also close this with fixVersion=1.1.0 if you consider the initial issue
fixed.
> Pluggable interface for shuffles
> --------------------------------
>
> Key: SPARK-2044
> URL: https://issues.apache.org/jira/browse/SPARK-2044
> Project: Spark
> Issue Type: Improvement
> Components: Shuffle, Spark Core
> Reporter: Matei Zaharia
> Assignee: Matei Zaharia
> Attachments: Pluggableshuffleproposal.pdf
>
>
> Given that a lot of the current activity in Spark Core is in shuffles, I
> wanted to propose factoring out shuffle implementations in a way that will
> make experimentation easier. Ideally we will converge on one implementation,
> but for a while, this could also be used to have several implementations
> coexist. I'm suggesting this because I aware of at least three efforts to
> look at shuffle (from Yahoo!, Intel and Databricks). Some of the things
> people are investigating are:
> * Push-based shuffle where data moves directly from mappers to reducers
> * Sorting-based instead of hash-based shuffle, to create fewer files (helps a
> lot with file handles and memory usage on large shuffles)
> * External spilling within a key
> * Changing the level of parallelism or even algorithm for downstream stages
> at runtime based on statistics of the map output (this is a thing we had
> prototyped in the Shark research project but never merged in core)
> I've attached a design doc with a proposed interface. It's not too crazy
> because the interface between shuffles and the rest of the code is already
> pretty narrow (just some iterators for reading data and a writer interface
> for writing it). Bigger changes will be needed in the interaction with
> DAGScheduler and BlockManager for some of the ideas above, but we can handle
> those separately, and this interface will allow us to experiment with some
> short-term stuff sooner.
> If things go well I'd also like to send a sort-based shuffle implementation
> for 1.1, but we'll see how the timing on that works out.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]