GitHub user mateiz opened a pull request:

    https://github.com/apache/spark/pull/1499

    (WIP) SPARK-2045 Sort-based shuffle

    This adds a new ShuffleManager based on sorting, as described in 
https://issues.apache.org/jira/browse/SPARK-2045. The bulk of the code is in an 
ExternalSorter class that is similar to ExternalAppendOnlyMap, but sorts 
key-value pairs by partition ID and can be used to create a single sorted file 
with a map task's output. (Longer-term I think this can take on the remaining 
functionality in ExternalAppendOnlyMap and replace it so we don't have code 
duplication.)
    
    The main TODOs still left are enabling ExternalSorter to merge *across* 
spilled files, though this may be of limited utility (if the data is highly 
combinable this will happen early on), and adding more tests (e.g. a version of 
our shuffle suite that runs on this).
    
    Despite this though, this seems to work pretty well (running successfully 
in cases where the hash shuffle would OOM, such as 1000 reduce tasks on 
executors with only 1G memory), and it seems to be comparable in speed or 
faster than hash-based shuffle (it will create much fewer files for the OS to 
keep track of). So I'm posting it to get some early feedback.
    
    After these TODOs are done, I'd also like to enable ExternalSorter to sort 
data within each partition by a key as well, which will allow us to use it to 
implement external spilling in reduce tasks in `sortByKey`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mateiz/spark sort-based-shuffle

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1499.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1499
    
----
commit cf014cb3ef394f8b34c717a2ab8c92f95edf6269
Author: Matei Zaharia <[email protected]>
Date:   2014-07-17T02:13:54Z

    Scaffolding for sort-based shuffle

commit c1279c4b43121ad47d4e0c0cd46ba978fc29c7de
Author: Matei Zaharia <[email protected]>
Date:   2014-07-18T02:07:41Z

    Some more partial work towards sort-based shuffle

commit 784e85a23c8bdea821dbbd2aa43f7e202ac39fe3
Author: Matei Zaharia <[email protected]>
Date:   2014-07-18T07:26:49Z

    More partial work towards sort-based shuffle

commit 727685bd23fa83df18f471ece5ded657f8656f81
Author: Matei Zaharia <[email protected]>
Date:   2014-07-19T04:18:17Z

    More work

commit abf052dbaafaea604f9b807e8ec5c38bd8e354fa
Author: Matei Zaharia <[email protected]>
Date:   2014-07-19T08:35:54Z

    Add more error handling and tests for error cases

commit 066553bc99a389556eb072161fe5acc43dd73262
Author: Matei Zaharia <[email protected]>
Date:   2014-07-20T06:40:40Z

    Add spill metrics to map tasks

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to