GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/6397

    [SPARK-7855] [WIP] Move bypassMergeSort-handling from ExternalSorter to own 
component

    Spark's `ExternalSorter` writes shuffle output files during sort-based 
shuffle. Sort-shuffle contains a configuration, 
`spark.shuffle.sort.bypassMergeThreshold`, which causes ExternalSorter to skip 
sorting and merging and simply write separate files per partition, which are 
then concatenated together to form the final map output file.
    
    The code paths used during this bypass are almost completely separate from 
ExternalSorter's other code paths, so refactoring them into a separate file can 
significantly simplify the code.
    
    In addition to re-arranging code, this patch deletes hundreds of lines of 
dead code.  The main entry point into ExternalSorter is `insertAll()` and in 
SPARK-4479 / #3422 this method was modified to completely bypass in-memory 
buffering of records when `bypassMergeSort` takes effect. As a result, the 
spilling / merging code paths will no longer be called when `bypassMergeSort` 
is used, so we should be able to safely remove that code.
    
    There's an open JIRA 
([SPARK-6026](https://issues.apache.org/jira/browse/SPARK-6026)) for removing 
the `bypassMergeThreshold` parameter and cdoe paths; I have not done that here, 
but the changes in this patch will make removing that parameter significantly 
easier if we ever decide to do that.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark external-sorter-bypass-cleanup

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/6397.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6397
    
----
commit 18959bb385d499271fc0495816578f5c767fa07c
Author: Josh Rosen <[email protected]>
Date:   2015-05-25T05:18:11Z

    Move comparator methods closer together.

commit 19bccd6a172bec8da747f2de5f78a4af8be488d1
Author: Josh Rosen <[email protected]>
Date:   2015-05-25T05:29:44Z

    Remove duplicated buffer creation code.

commit 8d0678c2c42feb94419e02042d8272df58513b20
Author: Josh Rosen <[email protected]>
Date:   2015-05-25T05:32:00Z

    Move diskBytesSpilled getter next to variable

commit 6185ee2db1d0d4a2103da739746003186b876721
Author: Josh Rosen <[email protected]>
Date:   2015-05-25T08:02:50Z

    WIP towards moving bypass code into own file.

commit b6cc1ebe63ada7a557fd1b5129481f30b6d3afc8
Author: Josh Rosen <[email protected]>
Date:   2015-05-25T08:31:07Z

    Realize that bypass never buffers; proceed to delete tons of code

commit bb9667876b0b5aa9f43ef871e0a3cb2edb8e3f8e
Author: Josh Rosen <[email protected]>
Date:   2015-05-25T08:34:24Z

    Add missing interface file

commit d4cb536ce8e2cc269413c10442f337eb21e6807b
Author: Josh Rosen <[email protected]>
Date:   2015-05-25T08:37:08Z

    Delete more unused code

commit 02355efd009ac2d667d49ab2f2e758447b463196
Author: Josh Rosen <[email protected]>
Date:   2015-05-25T08:49:11Z

    More simplification

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to