GitHub user xuanyuanking opened a pull request:
https://github.com/apache/spark/pull/19745
[SPARK-2926][Core][Follow Up] Sort shuffle reader for Spark 2.x
## What changes were proposed in this pull request?
As comment in
[SPARK-2926][https://issues.apache.org/jira/browse/SPARK-2926], this is the
follow up work for the old patch on Spark 2.x version. Also this is a preview
PR and will add more UT after community think it still worth to follow up.
Detailed benchmark attached in jira and this patch mainly to the work below:
1. For support Spark Streaming, Class `ShuffleBlockFetcherIterator` added
some wrapping work for ManageBuffer, so here I changes
ShuffleBlockFetcherIterator to get the ManagerBuffer, and do the wrapping work
out of ShuffleBlockFetcherIterator
2. Class `ShuffleMmeoryManager` has been replaced by `TaskMemoryManager`,
so I write a new class named ExternalMergerinherits from
`Spillable[ArrayBuffer[MemoryShuffleBlock]]`, this class manage all files and
in memory block during `SortShuffleReader.read()`
3. Add a tag named `canUseSortShuffleWriter` in `SortShuffleManager`, for
the bug fix of Spark UT error in the scenario of using `UnsafeShuffleWriter` in
shuffle write stage but using `SortShuffleReader` in shuffle read stage.
4. Add shuffle metrics of peakMemoryUsedBytes.
5. A Bug fix of datainconsistency in old patch. [Code
Link][https://github.com/xuanyuanking/spark/blob/f07c939a25839a5b0f69c504afb9aa008b1b3c5d/core/src/main/scala/org/apache/spark/util/collection/ExternalMerger.scala#L97]
## How was this patch tested?
Like the doc described, running a benchmark test vs current spark master
and has no data output diff. I will add more UT and complete this PR follow
community's advise.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/xuanyuanking/spark sort-shuffle-read-master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19745.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19745
----
commit b0f1f247cfee8fc7419c6fd3a831f54d1c9d4d63
Author: Yuanjian Li <[email protected]>
Date: 2017-10-19T06:39:53Z
Reimplementation for SPARK-2045 over branch 2.1
commit 1c07650d82f5c85189d4a5758722c3178caa0a3c
Author: Yuanjian Li <[email protected]>
Date: 2017-10-26T05:12:42Z
Code clean, include BlockManager and EmternalSorter reuse
commit dac1bf9662f1945df0efb5740df84980baa03d8e
Author: Yuanjian Li <[email protected]>
Date: 2017-10-26T05:35:22Z
Move ExternalMerger outside
commit 33418cae4c4eb80d12ff3bb7b0b4ee3f0a85575e
Author: Yuanjian Li <[email protected]>
Date: 2017-10-26T07:57:01Z
fix code style
commit f07c939a25839a5b0f69c504afb9aa008b1b3c5d
Author: Yuanjian Li <[email protected]>
Date: 2017-11-10T12:55:27Z
Bug fix for data inconsistency and more comments
commit ca43f1b44a41b68c2a9a83ced269c3ed644fef69
Author: Yuanjian Li <[email protected]>
Date: 2017-11-14T14:11:35Z
Fix unreasoning var name
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]