GitHub user ericvandenbergfb opened a pull request:
https://github.com/apache/spark/pull/19955
[SPARK-21867][CORE] Support async spilling in UnsafeShuffleWriter
## What changes were proposed in this pull request?
Add a multi-shuffle sorter which supports asynchronous spilling during a
shuffle external sort. The benefit is that we can insert and sort/spill in
parallel, reducing the overall latency for jobs that are heavy on shuffling (as
are many ads jobs). The multi-shuffle sorter is added between the
UnsafeShuffleWriter and ShuffleExternalSorter such that few changes are needed
outside of this component, and as such, we can see clearly there is little room
for regressing other code paths.
The multi-shuffle sorter is enabled via a configuration flag,
spark.shuffle.async.num.sorter (default 1) If the value is 1, then the
multi-shuffle sorter is not used, it must be configured to have multiple
sorters (>=2)
There is a design spec here attached to the jira.
## How was this patch tested?
Added unit tests specifically for the MultiShuffleSorter to exercise under
various spill and insert conditions.
Extended the UnsafeShuffleWriterSuite to run against a single
ShuffleExternalSorter or multiple via the MultiShuffleSorter to ensure no
regressions.
Ran against production work loads and observed gains and validated based on
logging.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ericvandenbergfb/spark
async.multi.shuffle.sorter.2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19955.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19955
----
commit 7f751ec23ba6d8c53c009edb7d62a460e9166d7f
Author: Eric Vandenberg <[email protected]>
Date: 2017-12-12T02:23:26Z
[SPARK-21867][CORE] Support async spilling in UnsafeShuffleWriter
Add a multi-shuffle sorter which supports asynchronous spilling during a
shuffle external sort.
The benefit is that we can insert and sort/spill in parallel, reducing the
overall latency for jobs that
are heavy on shuffling (as are many ads jobs). The multi-shuffle sorter is
added between the UnsafeShuffleWriter
and ShuffleExternalSorter such that few changes are needed outside of this
component, and as such, we can
see clearly there is little room for regressing other code paths.
The multi-shuffle sorter is enabled via a configuration flag,
spark.shuffle.async.num.sorter (default 1)
If the value is 1, then the multi-shuffle sorter is not used, it must be
configured to have multiple
sorters (>=2)
There is a design spec here attached to the jira.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]