wsry opened a new pull request #13595:
URL: https://github.com/apache/flink/pull/13595
## What is the purpose of the change
Hash-based blocking shuffle and sort-merge based blocking shuffle are two
main blocking shuffle implementations wildly adopted by existing distributed
data processing frameworks. Hash-based implementation writes data sent to
different reducer tasks into separate files concurrently while sort-merge based
approach writes those data together into a single file and merges those small
files into bigger ones. Compared to sort-merge based approach, hash-based
approach has several weak points when it comes to running large scale batch
jobs. By introducing the sort-merge based approach to Flink, we can improve
Flinkās capability of running large scale batch jobs.
## Brief change log
- Introduce SortBuffer and its implementation PartitionSortedBuffer for
sort-merge based blocking shuffle
- Introduce PartitionedFile and the corresponding writer/reader for
sort-merge based blocking shuffle
- Introduce sort-merge based result partition SortMergeResultPartition and
the corresponding subpartition reader
- Introduce new config options to enable sort-merge based blocking shuffle
- Introduce shuffle data compression to sort-merge based blocking shuffle
## Verifying this change
Several new tests are added to verify this change, including
PartitionSortedBufferTest, PartitionedFileWriteReadTest,
SortMergeResultPartitionTest and BlockingShuffleITCase.
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (yes / **no**)
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: (yes / **no**)
- The serializers: (yes / **no** / don't know)
- The runtime per-record code paths (performance sensitive): (yes / **no**
/ don't know)
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / **no** /
don't know)
- The S3 file system connector: (yes / **no** / don't know)
## Documentation
- Does this pull request introduce a new feature? (**yes** / no)
- If yes, how is the feature documented? (not applicable / **docs** /
**JavaDocs** / not documented)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]