infoverload commented on a change in pull request #476:
URL: https://github.com/apache/flink-web/pull/476#discussion_r735684779
##########
File path: _posts/2021-10-15-sort-shuffle-part1.md
##########
@@ -0,0 +1,224 @@
+---
+layout: post
+title: "Sort-Based Blocking Shuffle Implementation in Flink - Part One"
+date: 2021-10-15 00:00:00
+authors:
+- Yingjie Cao:
+ name: "Yingjie Cao (Kevin)"
+- Daisy Tsang:
+ name: "Daisy Tsang"
+excerpt: Flink has implemented the sort-based blocking shuffle (FLIP-148) for
batch data processing. In this blog post, we will take a close look at the
design & implementation details and see what we can gain from it.
+---
+
+The part one of this blog post will explain the
[motivation](#why-did-we-introduce-the-sort-based-implementation) behind
sort-based blocking shuffle, and present the [benchmark
results](#benchmark-results) to show how you can benefit from the new blocking
shuffle implementation. The [operations & performance
tuning](#operations--tuning) section provides guidelines on how to use this new
feature.
+
+{% toc %}
+
+# Introduction
+
+Data shuffling is an important stage in batch processing applications and
describes how data is sent from one operator to the next. In this phase, output
data of the upstream operator will spill over to persistent storages like disk,
then the downstream operator will read the corresponding data and process it.
Blocking shuffle means that intermediate results from operator A are not sent
immediately to operator B until operator A has completely finished.
+
+The hash-based and sort-based blocking shuffle are two main blocking shuffle
implementations widely adopted by existing distributed data processing
frameworks:
+
+1. **Hash-Based Approach:** The core idea behind the hash-based approach is to
write data consumed by different consumer tasks to different files and each
file can then serve as a natural boundary for the partitioned data.
+2. **Sort-Based Approach:** The core idea behind the sort-based approach is to
write all the produced data together first and then leverage sorting to cluster
data belonging to different data partitions or even keys.
+
+The sort-based blocking shuffle was introduced in Flink 1.12 and further
optimized and made production-ready in 1.13 for both stability and performance.
We hope you enjoy the improvements and any feedback is highly appreciated.
+
+# Why did we introduce the sort-based implementation?
Review comment:
```suggestion
# Motivation behind the sort-based implementation
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]