GitHub user mateiz opened a pull request:
https://github.com/apache/spark/pull/1499
(WIP) SPARK-2045 Sort-based shuffle
This adds a new ShuffleManager based on sorting, as described in
https://issues.apache.org/jira/browse/SPARK-2045. The bulk of the code is in an
ExternalSorter class that is similar to ExternalAppendOnlyMap, but sorts
key-value pairs by partition ID and can be used to create a single sorted file
with a map task's output. (Longer-term I think this can take on the remaining
functionality in ExternalAppendOnlyMap and replace it so we don't have code
duplication.)
The main TODOs still left are enabling ExternalSorter to merge *across*
spilled files, though this may be of limited utility (if the data is highly
combinable this will happen early on), and adding more tests (e.g. a version of
our shuffle suite that runs on this).
Despite this though, this seems to work pretty well (running successfully
in cases where the hash shuffle would OOM, such as 1000 reduce tasks on
executors with only 1G memory), and it seems to be comparable in speed or
faster than hash-based shuffle (it will create much fewer files for the OS to
keep track of). So I'm posting it to get some early feedback.
After these TODOs are done, I'd also like to enable ExternalSorter to sort
data within each partition by a key as well, which will allow us to use it to
implement external spilling in reduce tasks in `sortByKey`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mateiz/spark sort-based-shuffle
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1499.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1499
----
commit cf014cb3ef394f8b34c717a2ab8c92f95edf6269
Author: Matei Zaharia <[email protected]>
Date: 2014-07-17T02:13:54Z
Scaffolding for sort-based shuffle
commit c1279c4b43121ad47d4e0c0cd46ba978fc29c7de
Author: Matei Zaharia <[email protected]>
Date: 2014-07-18T02:07:41Z
Some more partial work towards sort-based shuffle
commit 784e85a23c8bdea821dbbd2aa43f7e202ac39fe3
Author: Matei Zaharia <[email protected]>
Date: 2014-07-18T07:26:49Z
More partial work towards sort-based shuffle
commit 727685bd23fa83df18f471ece5ded657f8656f81
Author: Matei Zaharia <[email protected]>
Date: 2014-07-19T04:18:17Z
More work
commit abf052dbaafaea604f9b807e8ec5c38bd8e354fa
Author: Matei Zaharia <[email protected]>
Date: 2014-07-19T08:35:54Z
Add more error handling and tests for error cases
commit 066553bc99a389556eb072161fe5acc43dd73262
Author: Matei Zaharia <[email protected]>
Date: 2014-07-20T06:40:40Z
Add spill metrics to map tasks
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---