GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/6397
[SPARK-7855] [WIP] Move bypassMergeSort-handling from ExternalSorter to own
component
Spark's `ExternalSorter` writes shuffle output files during sort-based
shuffle. Sort-shuffle contains a configuration,
`spark.shuffle.sort.bypassMergeThreshold`, which causes ExternalSorter to skip
sorting and merging and simply write separate files per partition, which are
then concatenated together to form the final map output file.
The code paths used during this bypass are almost completely separate from
ExternalSorter's other code paths, so refactoring them into a separate file can
significantly simplify the code.
In addition to re-arranging code, this patch deletes hundreds of lines of
dead code. The main entry point into ExternalSorter is `insertAll()` and in
SPARK-4479 / #3422 this method was modified to completely bypass in-memory
buffering of records when `bypassMergeSort` takes effect. As a result, the
spilling / merging code paths will no longer be called when `bypassMergeSort`
is used, so we should be able to safely remove that code.
There's an open JIRA
([SPARK-6026](https://issues.apache.org/jira/browse/SPARK-6026)) for removing
the `bypassMergeThreshold` parameter and cdoe paths; I have not done that here,
but the changes in this patch will make removing that parameter significantly
easier if we ever decide to do that.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark external-sorter-bypass-cleanup
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/6397.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #6397
----
commit 18959bb385d499271fc0495816578f5c767fa07c
Author: Josh Rosen <[email protected]>
Date: 2015-05-25T05:18:11Z
Move comparator methods closer together.
commit 19bccd6a172bec8da747f2de5f78a4af8be488d1
Author: Josh Rosen <[email protected]>
Date: 2015-05-25T05:29:44Z
Remove duplicated buffer creation code.
commit 8d0678c2c42feb94419e02042d8272df58513b20
Author: Josh Rosen <[email protected]>
Date: 2015-05-25T05:32:00Z
Move diskBytesSpilled getter next to variable
commit 6185ee2db1d0d4a2103da739746003186b876721
Author: Josh Rosen <[email protected]>
Date: 2015-05-25T08:02:50Z
WIP towards moving bypass code into own file.
commit b6cc1ebe63ada7a557fd1b5129481f30b6d3afc8
Author: Josh Rosen <[email protected]>
Date: 2015-05-25T08:31:07Z
Realize that bypass never buffers; proceed to delete tons of code
commit bb9667876b0b5aa9f43ef871e0a3cb2edb8e3f8e
Author: Josh Rosen <[email protected]>
Date: 2015-05-25T08:34:24Z
Add missing interface file
commit d4cb536ce8e2cc269413c10442f337eb21e6807b
Author: Josh Rosen <[email protected]>
Date: 2015-05-25T08:37:08Z
Delete more unused code
commit 02355efd009ac2d667d49ab2f2e758447b463196
Author: Josh Rosen <[email protected]>
Date: 2015-05-25T08:49:11Z
More simplification
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]