[GitHub] [flink-web] infoverload commented on a change in pull request #476: Add blog post "Sort-Based Blocking Shuffle Implementation in Flink"

GitBox Mon, 25 Oct 2021 07:54:50 -0700


infoverload commented on a change in pull request #476:
URL: https://github.com/apache/flink-web/pull/476#discussion_r735684779




##########
File path: _posts/2021-10-15-sort-shuffle-part1.md
##########
@@ -0,0 +1,224 @@
+---
+layout: post
+title: "Sort-Based Blocking Shuffle Implementation in Flink - Part One"
+date: 2021-10-15 00:00:00
+authors:
+- Yingjie Cao:
+  name: "Yingjie Cao (Kevin)"
+- Daisy Tsang:
+  name: "Daisy Tsang"
+excerpt: Flink has implemented the sort-based blocking shuffle (FLIP-148) for 
batch data processing. In this blog post, we will take a close look at the 
design & implementation details and see what we can gain from it.
+---
+
+The part one of this blog post will explain the 
[motivation](#why-did-we-introduce-the-sort-based-implementation) behind 
sort-based blocking shuffle, and present the [benchmark 
results](#benchmark-results) to show how you can benefit from the new blocking 
shuffle implementation. The [operations & performance 
tuning](#operations--tuning) section provides guidelines on how to use this new 
feature.
+
+{% toc %}
+
+# Introduction
+
+Data shuffling is an important stage in batch processing applications and 
describes how data is sent from one operator to the next. In this phase, output 
data of the upstream operator will spill over to persistent storages like disk, 
then the downstream operator will read the corresponding data and process it. 
Blocking shuffle means that intermediate results from operator A are not sent 
immediately to operator B until operator A has completely finished.
+
+The hash-based and sort-based blocking shuffle are two main blocking shuffle 
implementations widely adopted by existing distributed data processing 
frameworks:
+
+1. **Hash-Based Approach:** The core idea behind the hash-based approach is to 
write data consumed by different consumer tasks to different files and each 
file can then serve as a natural boundary for the partitioned data.
+2. **Sort-Based Approach:** The core idea behind the sort-based approach is to 
write all the produced data together first and then leverage sorting to cluster 
data belonging to different data partitions or even keys.
+
+The sort-based blocking shuffle was introduced in Flink 1.12 and further 
optimized and made production-ready in 1.13 for both stability and performance. 
We hope you enjoy the improvements and any feedback is highly appreciated.
+
+# Why did we introduce the sort-based implementation?

Review comment:
       ```suggestion
   # Motivation behind the sort-based implementation
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink-web] infoverload commented on a change in pull request #476: Add blog post "Sort-Based Blocking Shuffle Implementation in Flink"

Reply via email to