[ https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099136#comment-14099136 ]
Patrick Wendell commented on SPARK-2044: ---------------------------------------- A lot of this has been fixed in 1.1 so I moved target version to 1.2. [~matei] we can also close this with fixVersion=1.1.0 if you consider the initial issue fixed. > Pluggable interface for shuffles > -------------------------------- > > Key: SPARK-2044 > URL: https://issues.apache.org/jira/browse/SPARK-2044 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core > Reporter: Matei Zaharia > Assignee: Matei Zaharia > Attachments: Pluggableshuffleproposal.pdf > > > Given that a lot of the current activity in Spark Core is in shuffles, I > wanted to propose factoring out shuffle implementations in a way that will > make experimentation easier. Ideally we will converge on one implementation, > but for a while, this could also be used to have several implementations > coexist. I'm suggesting this because I aware of at least three efforts to > look at shuffle (from Yahoo!, Intel and Databricks). Some of the things > people are investigating are: > * Push-based shuffle where data moves directly from mappers to reducers > * Sorting-based instead of hash-based shuffle, to create fewer files (helps a > lot with file handles and memory usage on large shuffles) > * External spilling within a key > * Changing the level of parallelism or even algorithm for downstream stages > at runtime based on statistics of the map output (this is a thing we had > prototyped in the Shark research project but never merged in core) > I've attached a design doc with a proposed interface. It's not too crazy > because the interface between shuffles and the rest of the code is already > pretty narrow (just some iterators for reading data and a writer interface > for writing it). Bigger changes will be needed in the interaction with > DAGScheduler and BlockManager for some of the ideas above, but we can handle > those separately, and this interface will allow us to experiment with some > short-term stuff sooner. > If things go well I'd also like to send a sort-based shuffle implementation > for 1.1, but we'll see how the timing on that works out. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org