wsry opened a new pull request #13595:
URL: https://github.com/apache/flink/pull/13595


   ## What is the purpose of the change
   
   Hash-based blocking shuffle and sort-merge based blocking shuffle are two 
main blocking shuffle implementations wildly adopted by existing distributed 
data processing frameworks. Hash-based implementation writes data sent to 
different reducer tasks into separate files concurrently while sort-merge based 
approach writes those data together into a single file and merges those small 
files into bigger ones. Compared to sort-merge based approach, hash-based 
approach has several weak points when it comes to running large scale batch 
jobs. By introducing the sort-merge based approach to Flink, we can improve 
Flink’s capability of running large scale batch jobs.
   
   
   ## Brief change log
   
     - Introduce SortBuffer and its implementation PartitionSortedBuffer for 
sort-merge based blocking shuffle
     - Introduce PartitionedFile and the corresponding writer/reader for 
sort-merge based blocking shuffle
     - Introduce sort-merge based result partition SortMergeResultPartition and 
the corresponding subpartition reader
     - Introduce new config options to enable sort-merge based blocking shuffle
     - Introduce shuffle data compression to sort-merge based blocking shuffle
   
   
   ## Verifying this change
   
   Several new tests are added to verify this change, including 
PartitionSortedBufferTest, PartitionedFileWriteReadTest, 
SortMergeResultPartitionTest and BlockingShuffleITCase.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (yes / **no**)
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
     - The serializers: (yes / **no** / don't know)
     - The runtime per-record code paths (performance sensitive): (yes / **no** 
/ don't know)
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / **no** / 
don't know)
     - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (**yes** / no)
     - If yes, how is the feature documented? (not applicable / **docs** / 
**JavaDocs** / not documented)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to