[flink-web] 01/02: Backpressure monitoring and analysis blog post

pnowojski Wed, 07 Jul 2021 05:47:40 -0700

This is an automated email from the ASF dual-hosted git repository.

pnowojski pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/flink-web.git


commit 207954c45c2e85b873921c063fa282e8880849eb
Author: Piotr Nowojski <[email protected]>
AuthorDate: Fri Jul 2 13:09:02 2021 +0200

    Backpressure monitoring and analysis blog post
---
 _posts/2021-07-07-backpressure.md                  | 189 +++++++++++++
 content/blog/feed.xml                              | 291 ++++++++++++++-------
 content/blog/index.html                            |  38 +--
 content/blog/page10/index.html                     |  38 ++-
 content/blog/page11/index.html                     |  37 ++-
 content/blog/page12/index.html                     |  39 +--
 content/blog/page13/index.html                     |  38 ++-
 content/blog/page14/index.html                     |  37 ++-
 content/blog/page15/index.html                     |  39 +--
 content/blog/page16/index.html                     |  25 ++
 content/blog/page2/index.html                      |  38 ++-
 content/blog/page3/index.html                      |  36 ++-
 content/blog/page4/index.html                      |  36 ++-
 content/blog/page5/index.html                      |  36 ++-
 content/blog/page6/index.html                      |  38 +--
 content/blog/page7/index.html                      |  38 ++-
 content/blog/page8/index.html                      |  38 +--
 content/blog/page9/index.html                      |  40 +--
 content/index.html                                 |   8 +-
 content/zh/index.html                              |   8 +-
 img/blog/2021-07-07-backpressure/animated.png      | Bin 0 -> 847082 bytes
 .../2021-07-07-backpressure/bottleneck-zoom.png    | Bin 0 -> 185048 bytes
 .../2021-07-07-backpressure/simple-example.png     | Bin 0 -> 102308 bytes
 .../2021-07-07-backpressure/sliding-window.png     | Bin 0 -> 22391 bytes
 .../2021-07-07-backpressure/source-task-busy.png   | Bin 0 -> 25852 bytes
 img/blog/2021-07-07-backpressure/subtasks.png      | Bin 0 -> 137756 bytes
 26 files changed, 778 insertions(+), 309 deletions(-)

diff --git a/_posts/2021-07-07-backpressure.md 
b/_posts/2021-07-07-backpressure.md
new file mode 100644
index 0000000..b144d1b
--- /dev/null
+++ b/_posts/2021-07-07-backpressure.md
@@ -0,0 +1,189 @@
+---
+layout: post
+title:  "How to identify the source of backpressure?"
+date: 2021-07-07T00:00:00.000Z
+authors:
+- pnowojski:
+  name: "Piotr Nowojski"
+  twitter: "PiotrNowojski"
+excerpt: Apache Flink 1.13 introduced a couple of important changes in the 
area of backpressure monitoring and performance analysis of Flink Jobs. This 
blog post aims to introduce those changes and explain how to use them.
+---
+
+{% toc %}
+
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/blog/2021-07-07-backpressure/animated.png" 
alt="Backpressure monitoring in the web UI"/>
+       <p class="align-center">Backpressure monitoring in the web UI</p>
+</div>
+
+The backpressure topic was tackled from different angles over the last couple 
of years. However, when it comes 
+to identifying and analyzing sources of backpressure, things have changed 
quite a bit in the recent Flink releases 
+(especially with new additions to metrics and the web UI in Flink 1.13). This 
post will try to clarify some of 
+these changes and go into more detail about how to track down the source of 
backpressure, but first...
+
+## What is backpressure?
+
+This has been explained very well in an old, but still accurate, [post by Ufuk 
Celebi](https://www.ververica.com/blog/how-flink-handles-backpressure).
+I highly recommend reading it if you are not familiar with this concept. For a 
much deeper and low-level understanding of
+the topic and how Flink’s network stack works, there is a more [advanced 
explanation available 
here](https://alibabacloud.com/blog/analysis-of-network-flow-control-and-back-pressure-flink-advanced-tutorials_596632).
+
+At a high level, backpressure happens if some operator(s) in the Job Graph 
cannot process records at the
+same rate as they are received. This fills up the input buffers of the subtask 
that is running this slow operator.
+Once the input buffers are full, backpressure propagates to the output buffers 
of the upstream subtasks.
+Once those are filled up, the upstream subtasks are also forced to slow down 
their records’ processing
+rate to match the processing rate of the operator causing this bottleneck down 
the stream. Backpressure
+further propagates up the stream until it reaches the source operators.
+
+As long as the load and available resources are static and none of the 
operators produce short bursts of
+data (like windowing operators), those input/output buffers should only be in 
one of two states: almost empty
+or almost full. If the downstream operator or subtask is able to keep up with 
the influx of data, the
+buffers will be empty. If not, then the buffers will be full [<sup>1</sup>]. 
In fact, checking the buffers’ usage metrics
+was the basis of the previously recommended way on how to detect and analyze 
backpressure described [a couple
+of years back by Nico 
Kruber](https://flink.apache.org/2019/07/23/flink-network-stack-2.html#backpressure).
+As I mentioned in the beginning, Flink now offers much better tools to do the 
same job, but before we get to that,
+there are two questions worth asking.  
+
+### Why should I care about backpressure?
+
+Backpressure is an indicator that your machines or operators are overloaded. 
The buildup of backpressure
+directly affects the end-to-end latency of the system, as records are waiting 
longer in the queues before
+being processed. Secondly, aligned checkpointing takes longer with 
backpressure, while unaligned checkpoints
+will be larger (you can read more about aligned and unaligned checkpoints [in 
the 
documentation](https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/concepts/stateful-stream-processing/#checkpointing).
+If you are struggling with checkpoint barriers propagation times, taking care 
of backpressure would most
+likely help to solve the problem. Lastly, you might just want to optimize your 
job in order to reduce
+the costs of running the job.
+
+In order to address the problem for all cases, one needs to be aware of it, 
then locate and analyze it. 
+
+### Why shouldn’t I care about backpressure?
+
+Frankly, you do not  always have to care about the presence of backpressure. 
Almost by definition, lack
+of backpressure means that your cluster is at least ever so slightly 
underutilized and over-provisioned.
+If you want to minimize idling resources, you probably can not avoid incurring 
some backpressure. This
+is especially true for batch processing.
+
+## How to detect and track down the source of backpressure?
+
+One way to detect backpressure is to use 
[metrics](https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/metrics/#system-metrics),
+however, in Flink 1.13 it’s no longer necessary to dig so deep. In most cases, 
it should be enough to just
+look at the job graph in the Web UI.
+
+<div class="row front-graphic">
+  <img src="{{ site.baseurl 
}}/img/blog/2021-07-07-backpressure/simple-example.png"/>
+</div>
+
+The first thing to note in the example above is that different tasks have 
different colors. Those colors
+represent a combination of two factors: under how much backpressure this task 
is and how busy it is. Idling
+tasks will be blue, fully busy tasks will be red hot, and fully backpressured 
tasks will be black. Anything
+in between will be a combination/shade of those three colors. With this 
knowledge, one can easily spot the
+backpressured tasks (black). The busiest (red) task downstream of the 
backpressured tasks will most likely
+be the source of the backpressure (the bottleneck).
+
+If you click on one particular task and go into the “BackPressure” tab you 
will be able to further dissect
+the problem and check what is the busy/backpressured/idle status of every 
subtask in that task. For example,
+this is especially handy if there is a data skew and not all subtasks are 
equally utilized.
+
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/blog/2021-07-07-backpressure/subtasks.png" 
alt="Backpressure among subtasks"/>
+       <p class="align-center">Backpressure among subtasks</p>
+</div>
+
+In the above example, we can clearly see which subtasks are idling, which are 
backpressured, and that
+none of them are busy. And frankly, in a nutshell, that should be enough to 
quickly understand what is
+happening with your Job :) However, there are a couple of more details worth 
explaining.
+
+### What are those numbers?
+
+If you are curious how it works underneath, we can go a little deeper. At the 
base of this new mechanism
+we have three [new 
metrics](https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/metrics/#io)
+that are exposed and calculated by each subtask:
+- `idleTimeMsPerSecond`
+- `busyTimeMsPerSecond`
+- `backPressuredTimeMsPerSecond`
+Each of them measures the average time in milliseconds per second that the 
subtask spent being idle,
+busy, or backpressured respectively. Apart from some rounding errors they 
should complement each other and
+add up to 1000ms/s. In essence, they are quite similar to, for example, CPU 
usage metrics.
+
+Another important detail is that they are being averaged over a short period 
of time (a couple of seconds)
+and they take into account everything that is happening inside the subtask’s 
thread: operators, functions,
+timers, checkpointing, records serialization/deserialization, network stack, 
and other Flink internal
+overheads. A `WindowOperator` that is busy firing timers and producing results 
will be reported as busy or backpressured.
+A function doing some expensive computation in 
`CheckpointedFunction#snapshotState` call, for instance flushing
+internal buffers, will also be reported as busy. 
+
+One limitation, however, is that `busyTimeMsPerSecond` and 
`idleTimeMsPerSecond` metrics are oblivious
+to anything that is happening in separate threads, outside of the main 
subtask’s execution loop.
+Fortunately, this is only relevant for two cases:
+- Custom threads that you manually spawn in your operators (a discouraged 
practice).
+- Old-style sources that implement the deprecated `SourceFunction` interface. 
Such sources will report `NaN`/`N/A`
+as the value for busyTimeMsPerSecond. For more information on the topic of 
Data Sources please
+[take a look 
here](https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/datastream/sources/).
+
+<div class="row front-graphic">
+  <img src="{{ site.baseurl 
}}/img/blog/2021-07-07-backpressure/source-task-busy.png" alt="Old-style 
sources do not report busy time"/>
+       <p class="align-center">Old-style sources do not report busy time</p>
+</div>
+
+In order to present those raw numbers in the web UI, those metrics need to be 
aggregated from all subtasks
+(on the job graph we are showing only tasks). This is why the web UI presents 
the maximal value from all
+subtasks of a given task and why the aggregated maximal values of busy and 
backpressured may not add up to 100%.
+One subtask can be backpressured at 60%, while another can be busy at 60%.  
This can result in a task that
+is both backpressured and busy at 60%.
+
+### Varying load
+
+There is one more thing. Do you remember that those metrics are measured and 
averaged over a couple of seconds?
+Keep this in mind when analyzing jobs or tasks with varying load, such as 
(sub)tasks containing a `WindowOperator`
+that is firing periodically. Both the subtask with a constant load of 50% and 
the subtask that alternates every
+second between being fully busy and fully idle will be reporting the same 
value of `busyTimeMsPerSecond`
+of 500ms/s.
+
+Furthermore, varying load and especially firing windows can move the 
bottleneck to a different place in
+the job graph:
+
+<div class="row front-graphic">
+  <img src="{{ site.baseurl 
}}/img/blog/2021-07-07-backpressure/bottleneck-zoom.png" alt="Bottleneck 
alternating between two tasks"/>
+       <p class="align-center">Bottleneck alternating between two tasks</p>
+</div>
+
+<div class="row front-graphic">
+  <img src="{{ site.baseurl 
}}/img/blog/2021-07-07-backpressure/sliding-window.png" 
alt="SlidingWindowOperator"/>
+       <p class="align-center">SlidingWindowOperator</p>
+</div>
+
+In this particular example, `SlidingWindowOperator` was the bottleneck as long 
as it was accumulating records.
+However, as soon as it starts to fire its windows (once every 10 seconds), the 
downstream task
+`SlidingWindowCheckMapper -> Sink: SlidingWindowCheckPrintSink` becomes the 
bottleneck and `SlidingWindowOperator`
+gets backpressured. As those busy/backpressured/idle metrics are averaging 
time over a couple of seconds,
+this subtlety is not immediately visible and has to be read between the lines. 
On top of that, the web UI
+is updating its state only once every 10 seconds, which makes spotting more 
frequent changes a bit more difficult.
+
+## What can I do with backpressure?
+
+In general this is a complex topic that is worthy of a dedicated blog post. It 
was, to a certain extent,
+addressed in [previous blog 
posts](https://flink.apache.org/2019/07/23/flink-network-stack-2.html#:~:text=this%20is%20unnecessary.-,What%20to%20do%20with%20Backpressure%3F,-Assuming%20that%20you).
+In short, there are two high-level ways of dealing with backpressure. Either 
add more resources (more machines,
+faster CPU, more RAM, better network, using SSDs…) or optimize usage of the 
resources you already have
+(optimize the code, tune the configuration, avoid data skew). In either case, 
you first need to analyze
+what is causing backpressure by:
+1. Identifying the presence of backpressure.
+2. Locating which subtask(s) or machines are causing it.
+3. Digging deeper into what part of the code is causing it and which resource 
is scarce.
+
+Backpressure monitoring improvements and metrics can help you with the first 
two points. To tackle the
+last one, profiling the code can be the way to go. To help with profiling, 
also starting from Flink 1.13,
+[Flame Graphs](http://www.brendangregg.com/flamegraphs.html) are [integrated 
into Flink's web 
UI](https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/debugging/flame_graphs/).
+Flame Graphs is a well known profiling tool and visualization technique and I 
encourage you to give it a try. 
+
+But keep in mind that after locating where the bottleneck is, you can analyze 
it the same way you would
+any other non-distributed application (by checking resource utilization, 
attaching a profiler, etc).
+Usually there is no silver bullet for problems like this. You can try to scale 
up but sometimes it might
+not be easy or practical to do.
+
+Anyway... The aforementioned improvements to backpressure monitoring allow us 
to easily detect the source of backpressure,
+and Flame Graphs can help us to analyze why a particular subtask is causing 
problems. Together those two
+features should make the previously quite tedious process of debugging and 
performance analysis of Flink
+jobs that much easier! Please upgrade to Flink 1.13.x and try them out!
+
+[<sup>1</sup>] There is a third possibility. In a rare case when network 
exchange is actually the bottleneck in your job,
+the downstream task will have empty input buffers, while upstream output 
buffers will be full. <a class="anchor" id="1"></a>
\ No newline at end of file
diff --git a/content/blog/feed.xml b/content/blog/feed.xml
index 8e79808..dcf63e1 100644
--- a/content/blog/feed.xml
+++ b/content/blog/feed.xml
@@ -7,6 +7,207 @@
 <atom:link href="https://flink.apache.org/blog/feed.xml"; rel="self" 
type="application/rss+xml" />
 
 <item>
+<title>How to identify the source of backpressure?</title>
+<description>&lt;div class=&quot;page-toc&quot;&gt;
+&lt;ul id=&quot;markdown-toc&quot;&gt;
+  &lt;li&gt;&lt;a href=&quot;#what-is-backpressure&quot; 
id=&quot;markdown-toc-what-is-backpressure&quot;&gt;What is 
backpressure?&lt;/a&gt;    &lt;ul&gt;
+      &lt;li&gt;&lt;a href=&quot;#why-should-i-care-about-backpressure&quot; 
id=&quot;markdown-toc-why-should-i-care-about-backpressure&quot;&gt;Why should 
I care about backpressure?&lt;/a&gt;&lt;/li&gt;
+      &lt;li&gt;&lt;a href=&quot;#why-shouldnt-i-care-about-backpressure&quot; 
id=&quot;markdown-toc-why-shouldnt-i-care-about-backpressure&quot;&gt;Why 
shouldn’t I care about backpressure?&lt;/a&gt;&lt;/li&gt;
+    &lt;/ul&gt;
+  &lt;/li&gt;
+  &lt;li&gt;&lt;a 
href=&quot;#how-to-detect-and-track-down-the-source-of-backpressure&quot; 
id=&quot;markdown-toc-how-to-detect-and-track-down-the-source-of-backpressure&quot;&gt;How
 to detect and track down the source of backpressure?&lt;/a&gt;    &lt;ul&gt;
+      &lt;li&gt;&lt;a href=&quot;#what-are-those-numbers&quot; 
id=&quot;markdown-toc-what-are-those-numbers&quot;&gt;What are those 
numbers?&lt;/a&gt;&lt;/li&gt;
+      &lt;li&gt;&lt;a href=&quot;#varying-load&quot; 
id=&quot;markdown-toc-varying-load&quot;&gt;Varying load&lt;/a&gt;&lt;/li&gt;
+    &lt;/ul&gt;
+  &lt;/li&gt;
+  &lt;li&gt;&lt;a href=&quot;#what-can-i-do-with-backpressure&quot; 
id=&quot;markdown-toc-what-can-i-do-with-backpressure&quot;&gt;What can I do 
with backpressure?&lt;/a&gt;&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;/div&gt;
+
+&lt;div class=&quot;row front-graphic&quot;&gt;
+  &lt;img src=&quot;/img/blog/2021-07-07-backpressure/animated.png&quot; 
alt=&quot;Backpressure monitoring in the web UI&quot; /&gt;
+       &lt;p class=&quot;align-center&quot;&gt;Backpressure monitoring in the 
web UI&lt;/p&gt;
+&lt;/div&gt;
+
+&lt;p&gt;The backpressure topic was tackled from different angles over the 
last couple of years. However, when it comes 
+to identifying and analyzing sources of backpressure, things have changed 
quite a bit in the recent Flink releases 
+(especially with new additions to metrics and the web UI in Flink 1.13). This 
post will try to clarify some of 
+these changes and go into more detail about how to track down the source of 
backpressure, but first…&lt;/p&gt;
+
+&lt;h2 id=&quot;what-is-backpressure&quot;&gt;What is backpressure?&lt;/h2&gt;
+
+&lt;p&gt;This has been explained very well in an old, but still accurate, 
&lt;a 
href=&quot;https://www.ververica.com/blog/how-flink-handles-backpressure&quot;&gt;post
 by Ufuk Celebi&lt;/a&gt;.
+I highly recommend reading it if you are not familiar with this concept. For a 
much deeper and low-level understanding of
+the topic and how Flink’s network stack works, there is a more &lt;a 
href=&quot;https://alibabacloud.com/blog/analysis-of-network-flow-control-and-back-pressure-flink-advanced-tutorials_596632&quot;&gt;advanced
 explanation available here&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;At a high level, backpressure happens if some operator(s) in the Job 
Graph cannot process records at the
+same rate as they are received. This fills up the input buffers of the subtask 
that is running this slow operator.
+Once the input buffers are full, backpressure propagates to the output buffers 
of the upstream subtasks.
+Once those are filled up, the upstream subtasks are also forced to slow down 
their records’ processing
+rate to match the processing rate of the operator causing this bottleneck down 
the stream. Backpressure
+further propagates up the stream until it reaches the source 
operators.&lt;/p&gt;
+
+&lt;p&gt;As long as the load and available resources are static and none of 
the operators produce short bursts of
+data (like windowing operators), those input/output buffers should only be in 
one of two states: almost empty
+or almost full. If the downstream operator or subtask is able to keep up with 
the influx of data, the
+buffers will be empty. If not, then the buffers will be full 
[&lt;sup&gt;1&lt;/sup&gt;]. In fact, checking the buffers’ usage metrics
+was the basis of the previously recommended way on how to detect and analyze 
backpressure described &lt;a 
href=&quot;https://flink.apache.org/2019/07/23/flink-network-stack-2.html#backpressure&quot;&gt;a
 couple
+of years back by Nico Kruber&lt;/a&gt;.
+As I mentioned in the beginning, Flink now offers much better tools to do the 
same job, but before we get to that,
+there are two questions worth asking.&lt;/p&gt;
+
+&lt;h3 id=&quot;why-should-i-care-about-backpressure&quot;&gt;Why should I 
care about backpressure?&lt;/h3&gt;
+
+&lt;p&gt;Backpressure is an indicator that your machines or operators are 
overloaded. The buildup of backpressure
+directly affects the end-to-end latency of the system, as records are waiting 
longer in the queues before
+being processed. Secondly, aligned checkpointing takes longer with 
backpressure, while unaligned checkpoints
+will be larger (you can read more about aligned and unaligned checkpoints 
&lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/concepts/stateful-stream-processing/#checkpointing&quot;&gt;in
 the documentation&lt;/a&gt;.
+If you are struggling with checkpoint barriers propagation times, taking care 
of backpressure would most
+likely help to solve the problem. Lastly, you might just want to optimize your 
job in order to reduce
+the costs of running the job.&lt;/p&gt;
+
+&lt;p&gt;In order to address the problem for all cases, one needs to be aware 
of it, then locate and analyze it.&lt;/p&gt;
+
+&lt;h3 id=&quot;why-shouldnt-i-care-about-backpressure&quot;&gt;Why shouldn’t 
I care about backpressure?&lt;/h3&gt;
+
+&lt;p&gt;Frankly, you do not  always have to care about the presence of 
backpressure. Almost by definition, lack
+of backpressure means that your cluster is at least ever so slightly 
underutilized and over-provisioned.
+If you want to minimize idling resources, you probably can not avoid incurring 
some backpressure. This
+is especially true for batch processing.&lt;/p&gt;
+
+&lt;h2 
id=&quot;how-to-detect-and-track-down-the-source-of-backpressure&quot;&gt;How 
to detect and track down the source of backpressure?&lt;/h2&gt;
+
+&lt;p&gt;One way to detect backpressure is to use &lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/metrics/#system-metrics&quot;&gt;metrics&lt;/a&gt;,
+however, in Flink 1.13 it’s no longer necessary to dig so deep. In most cases, 
it should be enough to just
+look at the job graph in the Web UI.&lt;/p&gt;
+
+&lt;div class=&quot;row front-graphic&quot;&gt;
+  &lt;img src=&quot;/img/blog/2021-07-07-backpressure/simple-example.png&quot; 
/&gt;
+&lt;/div&gt;
+
+&lt;p&gt;The first thing to note in the example above is that different tasks 
have different colors. Those colors
+represent a combination of two factors: under how much backpressure this task 
is and how busy it is. Idling
+tasks will be blue, fully busy tasks will be red hot, and fully backpressured 
tasks will be black. Anything
+in between will be a combination/shade of those three colors. With this 
knowledge, one can easily spot the
+backpressured tasks (black). The busiest (red) task downstream of the 
backpressured tasks will most likely
+be the source of the backpressure (the bottleneck).&lt;/p&gt;
+
+&lt;p&gt;If you click on one particular task and go into the “BackPressure” 
tab you will be able to further dissect
+the problem and check what is the busy/backpressured/idle status of every 
subtask in that task. For example,
+this is especially handy if there is a data skew and not all subtasks are 
equally utilized.&lt;/p&gt;
+
+&lt;div class=&quot;row front-graphic&quot;&gt;
+  &lt;img src=&quot;/img/blog/2021-07-07-backpressure/subtasks.png&quot; 
alt=&quot;Backpressure among subtasks&quot; /&gt;
+       &lt;p class=&quot;align-center&quot;&gt;Backpressure among 
subtasks&lt;/p&gt;
+&lt;/div&gt;
+
+&lt;p&gt;In the above example, we can clearly see which subtasks are idling, 
which are backpressured, and that
+none of them are busy. And frankly, in a nutshell, that should be enough to 
quickly understand what is
+happening with your Job :) However, there are a couple of more details worth 
explaining.&lt;/p&gt;
+
+&lt;h3 id=&quot;what-are-those-numbers&quot;&gt;What are those 
numbers?&lt;/h3&gt;
+
+&lt;p&gt;If you are curious how it works underneath, we can go a little 
deeper. At the base of this new mechanism
+we have three &lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/metrics/#io&quot;&gt;new
 metrics&lt;/a&gt;
+that are exposed and calculated by each subtask:
+- &lt;code&gt;idleTimeMsPerSecond&lt;/code&gt;
+- &lt;code&gt;busyTimeMsPerSecond&lt;/code&gt;
+- &lt;code&gt;backPressuredTimeMsPerSecond&lt;/code&gt;
+Each of them measures the average time in milliseconds per second that the 
subtask spent being idle,
+busy, or backpressured respectively. Apart from some rounding errors they 
should complement each other and
+add up to 1000ms/s. In essence, they are quite similar to, for example, CPU 
usage metrics.&lt;/p&gt;
+
+&lt;p&gt;Another important detail is that they are being averaged over a short 
period of time (a couple of seconds)
+and they take into account everything that is happening inside the subtask’s 
thread: operators, functions,
+timers, checkpointing, records serialization/deserialization, network stack, 
and other Flink internal
+overheads. A &lt;code&gt;WindowOperator&lt;/code&gt; that is busy firing 
timers and producing results will be reported as busy or backpressured.
+A function doing some expensive computation in 
&lt;code&gt;CheckpointedFunction#snapshotState&lt;/code&gt; call, for instance 
flushing
+internal buffers, will also be reported as busy.&lt;/p&gt;
+
+&lt;p&gt;One limitation, however, is that 
&lt;code&gt;busyTimeMsPerSecond&lt;/code&gt; and 
&lt;code&gt;idleTimeMsPerSecond&lt;/code&gt; metrics are oblivious
+to anything that is happening in separate threads, outside of the main 
subtask’s execution loop.
+Fortunately, this is only relevant for two cases:
+- Custom threads that you manually spawn in your operators (a discouraged 
practice).
+- Old-style sources that implement the deprecated 
&lt;code&gt;SourceFunction&lt;/code&gt; interface. Such sources will report 
&lt;code&gt;NaN&lt;/code&gt;/&lt;code&gt;N/A&lt;/code&gt;
+as the value for busyTimeMsPerSecond. For more information on the topic of 
Data Sources please
+&lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/datastream/sources/&quot;&gt;take
 a look here&lt;/a&gt;.&lt;/p&gt;
+
+&lt;div class=&quot;row front-graphic&quot;&gt;
+  &lt;img 
src=&quot;/img/blog/2021-07-07-backpressure/source-task-busy.png&quot; 
alt=&quot;Old-style sources do not report busy time&quot; /&gt;
+       &lt;p class=&quot;align-center&quot;&gt;Old-style sources do not report 
busy time&lt;/p&gt;
+&lt;/div&gt;
+
+&lt;p&gt;In order to present those raw numbers in the web UI, those metrics 
need to be aggregated from all subtasks
+(on the job graph we are showing only tasks). This is why the web UI presents 
the maximal value from all
+subtasks of a given task and why the aggregated maximal values of busy and 
backpressured may not add up to 100%.
+One subtask can be backpressured at 60%, while another can be busy at 60%.  
This can result in a task that
+is both backpressured and busy at 60%.&lt;/p&gt;
+
+&lt;h3 id=&quot;varying-load&quot;&gt;Varying load&lt;/h3&gt;
+
+&lt;p&gt;There is one more thing. Do you remember that those metrics are 
measured and averaged over a couple of seconds?
+Keep this in mind when analyzing jobs or tasks with varying load, such as 
(sub)tasks containing a &lt;code&gt;WindowOperator&lt;/code&gt;
+that is firing periodically. Both the subtask with a constant load of 50% and 
the subtask that alternates every
+second between being fully busy and fully idle will be reporting the same 
value of &lt;code&gt;busyTimeMsPerSecond&lt;/code&gt;
+of 500ms/s.&lt;/p&gt;
+
+&lt;p&gt;Furthermore, varying load and especially firing windows can move the 
bottleneck to a different place in
+the job graph:&lt;/p&gt;
+
+&lt;div class=&quot;row front-graphic&quot;&gt;
+  &lt;img 
src=&quot;/img/blog/2021-07-07-backpressure/bottleneck-zoom.png&quot; 
alt=&quot;Bottleneck alternating between two tasks&quot; /&gt;
+       &lt;p class=&quot;align-center&quot;&gt;Bottleneck alternating between 
two tasks&lt;/p&gt;
+&lt;/div&gt;
+
+&lt;div class=&quot;row front-graphic&quot;&gt;
+  &lt;img src=&quot;/img/blog/2021-07-07-backpressure/sliding-window.png&quot; 
alt=&quot;SlidingWindowOperator&quot; /&gt;
+       &lt;p class=&quot;align-center&quot;&gt;SlidingWindowOperator&lt;/p&gt;
+&lt;/div&gt;
+
+&lt;p&gt;In this particular example, 
&lt;code&gt;SlidingWindowOperator&lt;/code&gt; was the bottleneck as long as it 
was accumulating records.
+However, as soon as it starts to fire its windows (once every 10 seconds), the 
downstream task
+&lt;code&gt;SlidingWindowCheckMapper -&amp;gt; Sink: 
SlidingWindowCheckPrintSink&lt;/code&gt; becomes the bottleneck and 
&lt;code&gt;SlidingWindowOperator&lt;/code&gt;
+gets backpressured. As those busy/backpressured/idle metrics are averaging 
time over a couple of seconds,
+this subtlety is not immediately visible and has to be read between the lines. 
On top of that, the web UI
+is updating its state only once every 10 seconds, which makes spotting more 
frequent changes a bit more difficult.&lt;/p&gt;
+
+&lt;h2 id=&quot;what-can-i-do-with-backpressure&quot;&gt;What can I do with 
backpressure?&lt;/h2&gt;
+
+&lt;p&gt;In general this is a complex topic that is worthy of a dedicated blog 
post. It was, to a certain extent,
+addressed in &lt;a 
href=&quot;https://flink.apache.org/2019/07/23/flink-network-stack-2.html#:~:text=this%20is%20unnecessary.-,What%20to%20do%20with%20Backpressure%3F,-Assuming%20that%20you&quot;&gt;previous
 blog posts&lt;/a&gt;.
+In short, there are two high-level ways of dealing with backpressure. Either 
add more resources (more machines,
+faster CPU, more RAM, better network, using SSDs…) or optimize usage of the 
resources you already have
+(optimize the code, tune the configuration, avoid data skew). In either case, 
you first need to analyze
+what is causing backpressure by:
+1. Identifying the presence of backpressure.
+2. Locating which subtask(s) or machines are causing it.
+3. Digging deeper into what part of the code is causing it and which resource 
is scarce.&lt;/p&gt;
+
+&lt;p&gt;Backpressure monitoring improvements and metrics can help you with 
the first two points. To tackle the
+last one, profiling the code can be the way to go. To help with profiling, 
also starting from Flink 1.13,
+&lt;a href=&quot;http://www.brendangregg.com/flamegraphs.html&quot;&gt;Flame 
Graphs&lt;/a&gt; are &lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/debugging/flame_graphs/&quot;&gt;integrated
 into Flink’s web UI&lt;/a&gt;.
+Flame Graphs is a well known profiling tool and visualization technique and I 
encourage you to give it a try.&lt;/p&gt;
+
+&lt;p&gt;But keep in mind that after locating where the bottleneck is, you can 
analyze it the same way you would
+any other non-distributed application (by checking resource utilization, 
attaching a profiler, etc).
+Usually there is no silver bullet for problems like this. You can try to scale 
up but sometimes it might
+not be easy or practical to do.&lt;/p&gt;
+
+&lt;p&gt;Anyway… The aforementioned improvements to backpressure monitoring 
allow us to easily detect the source of backpressure,
+and Flame Graphs can help us to analyze why a particular subtask is causing 
problems. Together those two
+features should make the previously quite tedious process of debugging and 
performance analysis of Flink
+jobs that much easier! Please upgrade to Flink 1.13.x and try them 
out!&lt;/p&gt;
+
+&lt;p&gt;[&lt;sup&gt;1&lt;/sup&gt;] There is a third possibility. In a rare 
case when network exchange is actually the bottleneck in your job,
+the downstream task will have empty input buffers, while upstream output 
buffers will be full. &lt;a class=&quot;anchor&quot; 
id=&quot;1&quot;&gt;&lt;/a&gt;&lt;/p&gt;
+</description>
+<pubDate>Wed, 07 Jul 2021 02:00:00 +0200</pubDate>
+<link>https://flink.apache.org/2021/07/07/backpressure.html</link>
+<guid isPermaLink="true">/2021/07/07/backpressure.html</guid>
+</item>
+
+<item>
 <title>Apache Flink 1.13.1 Released</title>
 <description>&lt;p&gt;The Apache Flink community released the first bugfix 
version of the Apache Flink 1.13 series.&lt;/p&gt;
 
@@ -19040,95 +19241,5 @@ Feedback through the Flink &lt;a 
href=&quot;http://flink.apache.org/community.ht
 <guid isPermaLink="true">/news/2018/02/15/release-1.4.1.html</guid>
 </item>
 
-<item>
-<title>Managing Large State in Apache Flink: An Intro to Incremental 
Checkpointing</title>
-<description>&lt;p&gt;Apache Flink was purpose-built for 
&lt;em&gt;stateful&lt;/em&gt; stream processing. However, what is state in a 
stream processing application? I defined state and stateful stream processing 
in a &lt;a 
href=&quot;http://flink.apache.org/features/2017/07/04/flink-rescalable-state.html&quot;&gt;previous
 blog post&lt;/a&gt;, and in case you need a refresher, &lt;em&gt;state is 
defined as memory in an application’s operators that stores information about 
previously-seen  [...]
-
-&lt;p&gt;State is a fundamental, enabling concept in stream processing 
required for a majority of complex use cases. Some examples highlighted in the 
&lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/state.html&quot;&gt;Flink
 documentation&lt;/a&gt;:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;When an application searches for certain event patterns, the state 
stores the sequence of events encountered so far.&lt;/li&gt;
-  &lt;li&gt;When aggregating events per minute, the state holds the pending 
aggregates.&lt;/li&gt;
-  &lt;li&gt;When training a machine learning model over a stream of data 
points, the state holds the current version of the model parameters.&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;However, stateful stream processing is only useful in production 
environments if the state is fault tolerant. “Fault tolerance” means that even 
if there’s a software or machine failure, the computed end-result is accurate, 
with no data loss or double-counting of events.&lt;/p&gt;
-
-&lt;p&gt;Flink’s fault tolerance has always been a powerful and popular 
feature, minimizing the impact of software or machine failure on your business 
and making it possible to guarantee exactly-once results from a Flink 
application.&lt;/p&gt;
-
-&lt;p&gt;Core to this is &lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/checkpointing.html&quot;&gt;checkpointing&lt;/a&gt;,
 which is the mechanism Flink uses to make application state fault tolerant. A 
checkpoint in Flink is a global, asynchronous snapshot of application state 
that’s taken on a regular interval and sent to durable storage (usually, a 
distributed file system). In the event of a failure, Flink restarts an 
application using the most [...]
-
-&lt;p&gt;Before incremental checkpointing, every single Flink checkpoint 
consisted of the full state of an application. We created the incremental 
checkpointing feature after we noticed that writing the full state for every 
checkpoint was often unnecessary, as the state changes from one checkpoint to 
the next were rarely that large. Incremental checkpointing instead maintains 
the differences (or ‘delta’) between each checkpoint and stores only the 
differences between the last checkpoint  [...]
-
-&lt;p&gt;Incremental checkpoints can provide a significant performance 
improvement for jobs with a very large state. Early testing of the feature by a 
production user with terabytes of state shows a drop in checkpoint time from 
more than 3 minutes down to 30 seconds after implementing incremental 
checkpoints. This is because the checkpoint doesn’t need to transfer the full 
state to durable storage on each checkpoint.&lt;/p&gt;
-
-&lt;h3 id=&quot;how-to-start&quot;&gt;How to Start&lt;/h3&gt;
-
-&lt;p&gt;Currently, you can only use incremental checkpointing with a RocksDB 
state back-end, and Flink uses RocksDB’s internal backup mechanism to 
consolidate checkpoint data over time. As a result, the incremental checkpoint 
history in Flink does not grow indefinitely, and Flink eventually consumes and 
prunes old checkpoints automatically.&lt;/p&gt;
-
-&lt;p&gt;To enable incremental checkpointing in your application, I recommend 
you read the &lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/state/large_state_tuning.html#tuning-rocksdb&quot;&gt;the
 Apache Flink documentation on checkpointing&lt;/a&gt; for full details, but in 
summary, you enable checkpointing as normal, but enable incremental 
checkpointing in the constructor by setting the second parameter to 
&lt;code&gt;true&lt;/code&gt;.&lt;/p&gt;
-
-&lt;h4 id=&quot;java-example&quot;&gt;Java Example&lt;/h4&gt;
-
-&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code 
class=&quot;language-java&quot;&gt;&lt;span 
class=&quot;n&quot;&gt;StreamExecutionEnvironment&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;env&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;StreamExecutionEnvironment&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;na&quot;&gt;getExecutionEnvironment&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;();&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;na&quot;&gt;setStateBackend&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span 
class=&quot;nf&quot;&gt;RocksDBStateBackend&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;filebackend&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kc&quot [...]
-
-&lt;h4 id=&quot;scala-example&quot;&gt;Scala Example&lt;/h4&gt;
-
-&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code 
class=&quot;language-scala&quot;&gt;&lt;span 
class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;env&lt;/span&gt; &lt;span 
class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;nc&quot;&gt;StreamExecutionEnvironment&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;getExecutionEnvironment&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;()&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;setStateBackend&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span 
class=&quot;nc&quot;&gt;RocksDBStateBackend&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;filebackend&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kc&quot; [...]
-
-&lt;p&gt;By default, Flink retains 1 completed checkpoint, so if you need a 
higher number, &lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/state/checkpointing.html#related-config-options&quot;&gt;you
 can configure it with the following flag&lt;/a&gt;:&lt;/p&gt;
-
-&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code 
class=&quot;language-java&quot;&gt;&lt;span 
class=&quot;n&quot;&gt;state&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;na&quot;&gt;checkpoints&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;na&quot;&gt;num&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;retained&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
-
-&lt;h3 id=&quot;how-it-works&quot;&gt;How it Works&lt;/h3&gt;
-
-&lt;p&gt;Flink’s incremental checkpointing uses &lt;a 
href=&quot;https://github.com/facebook/rocksdb/wiki/Checkpoints&quot;&gt;RocksDB
 checkpoints&lt;/a&gt; as a foundation. RocksDB is a key-value store based on 
‘&lt;a 
href=&quot;https://en.wikipedia.org/wiki/Log-structured_merge-tree&quot;&gt;log-structured-merge&lt;/a&gt;’
 (LSM) trees that collects all changes in a mutable (changeable) in-memory 
buffer called a ‘memtable’. Any updates to the same key in the memtable replace 
previous va [...]
-
-&lt;p&gt;A ‘compaction’ background task merges sstables to consolidate 
potential duplicates for each key, and over time RocksDB deletes the original 
sstables, with the merged sstable containing all information from across all 
the other sstables.&lt;/p&gt;
-
-&lt;p&gt;On top of this, Flink tracks which sstable files RocksDB has created 
and deleted since the previous checkpoint, and as the sstables are immutable, 
Flink uses this to figure out the state changes. To do this, Flink triggers a 
flush in RocksDB, forcing all memtables into sstables on disk, and hard-linked 
in a local temporary directory. This process is synchronous to the processing 
pipeline, and Flink performs all further steps asynchronously and does not 
block processing.&lt;/p&gt;
-
-&lt;p&gt;Then Flink copies all new sstables to stable storage (e.g., HDFS, S3) 
to reference in the new checkpoint. Flink doesn’t copy all sstables that 
already existed in the previous checkpoint to stable storage but re-reference 
them. Any new checkpoints will no longer reference deleted files as deleted 
sstables in RocksDB are always the result of compaction, and it eventually 
replaces old tables with an sstable that is the result of a merge. This how in 
Flink’s incremental checkpoints  [...]
-
-&lt;p&gt;For tracking changes between checkpoints, the uploading of 
consolidated tables is redundant work. Flink performs the process 
incrementally, and typically adds only a small overhead, so we consider this 
worthwhile because it allows Flink to keep a shorter history of checkpoints to 
consider in a recovery.&lt;/p&gt;
-
-&lt;h4 id=&quot;an-example&quot;&gt;An Example&lt;/h4&gt;
-
-&lt;p&gt;&lt;img src=&quot;/img/blog/incremental_cp_impl_example.svg&quot; 
alt=&quot;Example setup&quot; /&gt;
-&lt;em&gt;Example setup&lt;/em&gt;&lt;/p&gt;
-
-&lt;p&gt;Take an example with a subtask of one operator that has a keyed 
state, and the number of retained checkpoints set at 
&lt;strong&gt;2&lt;/strong&gt;. The columns in the figure above show the state 
of the local RocksDB instance for each checkpoint, the files it references, and 
the counts in the shared state registry after the checkpoint 
completes.&lt;/p&gt;
-
-&lt;p&gt;For checkpoint ‘CP 1’, the local RocksDB directory contains two 
sstable files, it considers these new and uploads them to stable storage using 
directory names that match the checkpoint name. When the checkpoint completes, 
Flink creates the two entries in the shared state registry and sets their 
counts to ‘1’. The key in the shared state registry is a composite of an 
operator, subtask, and the original sstable file name. The registry also keeps 
a mapping from the key to the file  [...]
-
-&lt;p&gt;For checkpoint ‘CP 2’, RocksDB has created two new sstable files, and 
the two older ones still exist. For checkpoint ‘CP 2’, Flink adds the two new 
files to stable storage and can reference the previous two files. When the 
checkpoint completes, Flink increases the counts for all referenced files by 
1.&lt;/p&gt;
-
-&lt;p&gt;For checkpoint ‘CP 3’, RocksDB’s compaction has merged 
&lt;code&gt;sstable-(1)&lt;/code&gt;, &lt;code&gt;sstable-(2)&lt;/code&gt;, and 
&lt;code&gt;sstable-(3)&lt;/code&gt; into 
&lt;code&gt;sstable-(1,2,3)&lt;/code&gt; and deleted the original files. This 
merged file contains the same information as the source files, with all 
duplicate entries eliminated. In addition to this merged file, 
&lt;code&gt;sstable-(4)&lt;/code&gt; still exists and there is now a new 
&lt;code&gt;sstable- [...]
-
-&lt;p&gt;For checkpoint ‘CP-4’, RocksDB has merged 
&lt;code&gt;sstable-(4)&lt;/code&gt;, &lt;code&gt;sstable-(5)&lt;/code&gt;, and 
a new &lt;code&gt;sstable-(6)&lt;/code&gt; into 
&lt;code&gt;sstable-(4,5,6)&lt;/code&gt;. Flink adds this new table to stable 
storage and references it together with 
&lt;code&gt;sstable-(1,2,3)&lt;/code&gt;, it increases the counts for 
&lt;code&gt;sstable-(1,2,3)&lt;/code&gt; and 
&lt;code&gt;sstable-(4,5,6)&lt;/code&gt; by 1 and then deletes ‘CP-2’ as the 
num [...]
-
-&lt;h3 id=&quot;race-conditions-and-concurrent-checkpoints&quot;&gt;Race 
Conditions and Concurrent Checkpoints&lt;/h3&gt;
-
-&lt;p&gt;As Flink can execute multiple checkpoints in parallel, sometimes new 
checkpoints start before confirming previous checkpoints as completed. Because 
of this, you should consider which the previous checkpoint to use as a basis 
for a new incremental checkpoint. Flink only references state from a checkpoint 
confirmed by the checkpoint coordinator so that it doesn’t unintentionally 
reference a deleted shared file.&lt;/p&gt;
-
-&lt;h3 
id=&quot;restoring-checkpoints-and-performance-considerations&quot;&gt;Restoring
 Checkpoints and Performance Considerations&lt;/h3&gt;
-
-&lt;p&gt;If you enable incremental checkpointing, there are no further 
configuration steps needed to recover your state in case of failure. If a 
failure occurs, Flink’s &lt;code&gt;JobManager&lt;/code&gt; tells all tasks to 
restore from the last completed checkpoint, be it a full or incremental 
checkpoint. Each &lt;code&gt;TaskManager&lt;/code&gt; then downloads their 
share of the state from the checkpoint on the distributed file system.&lt;/p&gt;
-
-&lt;p&gt;Though the feature can lead to a substantial improvement in 
checkpoint time for users with a large state, there are trade-offs to consider 
with incremental checkpointing. Overall, the process reduces the checkpointing 
time during normal operations but can lead to a longer recovery time depending 
on the size of your state. If the cluster failure is particularly severe and 
the Flink &lt;code&gt;TaskManager&lt;/code&gt;s have to read from multiple 
checkpoints, recovery can be a slo [...]
-
-&lt;p&gt;There are some strategies for improving the convenience/performance 
trade-off, and I recommend you read &lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/state/checkpoints.html#basics-of-incremental-checkpoints&quot;&gt;the
 Flink documentation&lt;/a&gt; for more details.&lt;/p&gt;
-
-&lt;p&gt;&lt;em&gt;This post &lt;a 
href=&quot;https://data-artisans.com/blog/managing-large-state-apache-flink-incremental-checkpointing-overview&quot;
 target=&quot;_blank&quot;&gt; originally appeared on the data Artisans blog 
&lt;/a&gt;and was contributed to the Flink blog by Stefan Richter and Chris 
Ward.&lt;/em&gt;&lt;/p&gt;
-&lt;link rel=&quot;canonical&quot; 
href=&quot;https://data-artisans.com/blog/managing-large-state-apache-flink-incremental-checkpointing-overview&quot;
 /&gt;
-
-</description>
-<pubDate>Tue, 30 Jan 2018 13:00:00 +0100</pubDate>
-<link>https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html</link>
-<guid 
isPermaLink="true">/features/2018/01/30/incremental-checkpointing.html</guid>
-</item>
-
 </channel>
 </rss>
diff --git a/content/blog/index.html b/content/blog/index.html
index b944645..2531f3e 100644
--- a/content/blog/index.html
+++ b/content/blog/index.html
@@ -201,6 +201,19 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a href="/2021/07/07/backpressure.html">How to 
identify the source of backpressure?</a></h2>
+
+      <p>07 Jul 2021
+       Piotr Nowojski (<a 
href="https://twitter.com/PiotrNowojski";>@PiotrNowojski</a>)</p>
+
+      <p>Apache Flink 1.13 introduced a couple of important changes in the 
area of backpressure monitoring and performance analysing of Flink Jobs. This 
blog post aims to introduce those changes and explain how to use them.</p>
+
+      <p><a href="/2021/07/07/backpressure.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2021/05/28/release-1.13.1.html">Apache Flink 1.13.1 
Released</a></h2>
 
       <p>28 May 2021
@@ -329,21 +342,6 @@ to develop scalable, consistent, and elastic distributed 
applications.</p>
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2021/01/29/release-1.10.3.html">Apache Flink 1.10.3 
Released</a></h2>
-
-      <p>29 Jan 2021
-       Xintong Song </p>
-
-      <p><p>The Apache Flink community released the third bugfix version of 
the Apache Flink 1.10 series.</p>
-
-</p>
-
-      <p><a href="/news/2021/01/29/release-1.10.3.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -376,6 +374,16 @@ to develop scalable, consistent, and elastic distributed 
applications.</p>
 
     <ul id="markdown-toc">
       
+      <li><a href="/2021/07/07/backpressure.html">How to identify the source 
of backpressure?</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2021/05/28/release-1.13.1.html">Apache Flink 1.13.1 
Released</a></li>
 
       
diff --git a/content/blog/page10/index.html b/content/blog/page10/index.html
index f6f6c19..b3253dc 100644
--- a/content/blog/page10/index.html
+++ b/content/blog/page10/index.html
@@ -201,6 +201,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2018/09/20/release-1.5.4.html">Apache Flink 1.5.4 Released</a></h2>
+
+      <p>20 Sep 2018
+      </p>
+
+      <p><p>The Apache Flink community released the fourth bugfix version of 
the Apache Flink 1.5 series.</p>
+
+</p>
+
+      <p><a href="/news/2018/09/20/release-1.5.4.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2018/08/21/release-1.5.3.html">Apache Flink 1.5.3 Released</a></h2>
 
       <p>21 Aug 2018
@@ -333,19 +348,6 @@
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/features/2018/01/30/incremental-checkpointing.html">Managing Large State 
in Apache Flink: An Intro to Incremental Checkpointing</a></h2>
-
-      <p>30 Jan 2018
-       Stefan Ricther (<a 
href="https://twitter.com/StefanRRicther";>@StefanRRicther</a>) &amp; Chris Ward 
(<a href="https://twitter.com/chrischinch";>@chrischinch</a>)</p>
-
-      <p>Flink 1.3.0 introduced incremental checkpointing, making it possible 
for applications with large state to generate checkpoints more efficiently.</p>
-
-      <p><a 
href="/features/2018/01/30/incremental-checkpointing.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -378,6 +380,16 @@
 
     <ul id="markdown-toc">
       
+      <li><a href="/2021/07/07/backpressure.html">How to identify the source 
of backpressure?</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2021/05/28/release-1.13.1.html">Apache Flink 1.13.1 
Released</a></li>
 
       
diff --git a/content/blog/page11/index.html b/content/blog/page11/index.html
index d7914bb..8b86f5f 100644
--- a/content/blog/page11/index.html
+++ b/content/blog/page11/index.html
@@ -201,6 +201,19 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/features/2018/01/30/incremental-checkpointing.html">Managing Large State 
in Apache Flink: An Intro to Incremental Checkpointing</a></h2>
+
+      <p>30 Jan 2018
+       Stefan Ricther (<a 
href="https://twitter.com/StefanRRicther";>@StefanRRicther</a>) &amp; Chris Ward 
(<a href="https://twitter.com/chrischinch";>@chrischinch</a>)</p>
+
+      <p>Flink 1.3.0 introduced incremental checkpointing, making it possible 
for applications with large state to generate checkpoints more efficiently.</p>
+
+      <p><a 
href="/features/2018/01/30/incremental-checkpointing.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2017/12/21/2017-year-in-review.html">Apache Flink in 2017: Year in 
Review</a></h2>
 
       <p>21 Dec 2017
@@ -336,20 +349,6 @@ what’s coming in Flink 1.4.0 as well as a preview of what 
the Flink community
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2017/04/04/dynamic-tables.html">Continuous Queries on Dynamic 
Tables</a></h2>
-
-      <p>04 Apr 2017 by Fabian Hueske, Shaoxuan Wang, and Xiaowei Jiang
-      </p>
-
-      <p><p>Flink's relational APIs, the Table API and SQL, are unified APIs 
for stream and batch processing, meaning that a query produces the same result 
when being evaluated on streaming or static data.</p>
-<p>In this blog post we discuss the future of these APIs and introduce the 
concept of Dynamic Tables. Dynamic tables will significantly expand the scope 
of the Table API and SQL on streams and enable many more advanced use cases. We 
discuss how streams and dynamic tables relate to each other and explain the 
semantics of continuously evaluating queries on dynamic tables.</p></p>
-
-      <p><a href="/news/2017/04/04/dynamic-tables.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -382,6 +381,16 @@ what’s coming in Flink 1.4.0 as well as a preview of what 
the Flink community
 
     <ul id="markdown-toc">
       
+      <li><a href="/2021/07/07/backpressure.html">How to identify the source 
of backpressure?</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2021/05/28/release-1.13.1.html">Apache Flink 1.13.1 
Released</a></li>
 
       
diff --git a/content/blog/page12/index.html b/content/blog/page12/index.html
index b4e09e1..e70b8b2 100644
--- a/content/blog/page12/index.html
+++ b/content/blog/page12/index.html
@@ -201,6 +201,20 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2017/04/04/dynamic-tables.html">Continuous Queries on Dynamic 
Tables</a></h2>
+
+      <p>04 Apr 2017 by Fabian Hueske, Shaoxuan Wang, and Xiaowei Jiang
+      </p>
+
+      <p><p>Flink's relational APIs, the Table API and SQL, are unified APIs 
for stream and batch processing, meaning that a query produces the same result 
when being evaluated on streaming or static data.</p>
+<p>In this blog post we discuss the future of these APIs and introduce the 
concept of Dynamic Tables. Dynamic tables will significantly expand the scope 
of the Table API and SQL on streams and enable many more advanced use cases. We 
discuss how streams and dynamic tables relate to each other and explain the 
semantics of continuously evaluating queries on dynamic tables.</p></p>
+
+      <p><a href="/news/2017/04/04/dynamic-tables.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2017/03/29/table-sql-api-update.html">From Streams to Tables and 
Back Again: An Update on Flink's Table & SQL API</a></h2>
 
       <p>29 Mar 2017 by Timo Walther (<a 
href="https://twitter.com/";>@twalthr</a>)
@@ -329,21 +343,6 @@
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2016/08/08/release-1.1.0.html">Announcing Apache Flink 
1.1.0</a></h2>
-
-      <p>08 Aug 2016
-      </p>
-
-      <p><div class="alert alert-success"><strong>Important</strong>: The 
Maven artifacts published with version 1.1.0 on Maven central have a Hadoop 
dependency issue. It is highly recommended to use <strong>1.1.1</strong> or 
<strong>1.1.1-hadoop1</strong> as the Flink version.</div>
-
-</p>
-
-      <p><a href="/news/2016/08/08/release-1.1.0.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -376,6 +375,16 @@
 
     <ul id="markdown-toc">
       
+      <li><a href="/2021/07/07/backpressure.html">How to identify the source 
of backpressure?</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2021/05/28/release-1.13.1.html">Apache Flink 1.13.1 
Released</a></li>
 
       
diff --git a/content/blog/page13/index.html b/content/blog/page13/index.html
index ed7dadd..fae12ab 100644
--- a/content/blog/page13/index.html
+++ b/content/blog/page13/index.html
@@ -201,6 +201,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2016/08/08/release-1.1.0.html">Announcing Apache Flink 
1.1.0</a></h2>
+
+      <p>08 Aug 2016
+      </p>
+
+      <p><div class="alert alert-success"><strong>Important</strong>: The 
Maven artifacts published with version 1.1.0 on Maven central have a Hadoop 
dependency issue. It is highly recommended to use <strong>1.1.1</strong> or 
<strong>1.1.1-hadoop1</strong> as the Flink version.</div>
+
+</p>
+
+      <p><a href="/news/2016/08/08/release-1.1.0.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a href="/news/2016/05/24/stream-sql.html">Stream 
Processing for Everyone with SQL and Apache Flink</a></h2>
 
       <p>24 May 2016 by Fabian Hueske (<a 
href="https://twitter.com/";>@fhueske</a>)
@@ -330,19 +345,6 @@
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2015/12/11/storm-compatibility.html">Storm Compatibility in Apache 
Flink: How to run existing Storm topologies on Flink</a></h2>
-
-      <p>11 Dec 2015 by Matthias J. Sax (<a 
href="https://twitter.com/";>@MatthiasJSax</a>)
-      </p>
-
-      <p>In this blog post, we describe Flink's compatibility package for <a 
href="https://storm.apache.org";>Apache Storm</a> that allows to embed Spouts 
(sources) and Bolts (operators) in a regular Flink streaming job. Furthermore, 
the compatibility package provides a Storm compatible API in order to execute 
whole Storm topologies with (almost) no code adaption.</p>
-
-      <p><a href="/news/2015/12/11/storm-compatibility.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -375,6 +377,16 @@
 
     <ul id="markdown-toc">
       
+      <li><a href="/2021/07/07/backpressure.html">How to identify the source 
of backpressure?</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2021/05/28/release-1.13.1.html">Apache Flink 1.13.1 
Released</a></li>
 
       
diff --git a/content/blog/page14/index.html b/content/blog/page14/index.html
index 992d112..6d2893a 100644
--- a/content/blog/page14/index.html
+++ b/content/blog/page14/index.html
@@ -201,6 +201,19 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2015/12/11/storm-compatibility.html">Storm Compatibility in Apache 
Flink: How to run existing Storm topologies on Flink</a></h2>
+
+      <p>11 Dec 2015 by Matthias J. Sax (<a 
href="https://twitter.com/";>@MatthiasJSax</a>)
+      </p>
+
+      <p>In this blog post, we describe Flink's compatibility package for <a 
href="https://storm.apache.org";>Apache Storm</a> that allows to embed Spouts 
(sources) and Bolts (operators) in a regular Flink streaming job. Furthermore, 
the compatibility package provides a Storm compatible API in order to execute 
whole Storm topologies with (almost) no code adaption.</p>
+
+      <p><a href="/news/2015/12/11/storm-compatibility.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2015/12/04/Introducing-windows.html">Introducing Stream Windows in 
Apache Flink</a></h2>
 
       <p>04 Dec 2015 by Fabian Hueske (<a 
href="https://twitter.com/";>@fhueske</a>)
@@ -338,20 +351,6 @@ vertex-centric or gather-sum-apply to Flink dataflows.</p>
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2015/05/11/Juggling-with-Bits-and-Bytes.html">Juggling with Bits 
and Bytes</a></h2>
-
-      <p>11 May 2015 by Fabian Hüske (<a 
href="https://twitter.com/";>@fhueske</a>)
-      </p>
-
-      <p><p>Nowadays, a lot of open-source systems for analyzing large data 
sets are implemented in Java or other JVM-based programming languages. The most 
well-known example is Apache Hadoop, but also newer frameworks such as Apache 
Spark, Apache Drill, and also Apache Flink run on JVMs. A common challenge that 
JVM-based data analysis engines face is to store large amounts of data in 
memory - both for caching and for efficient processing such as sorting and 
joining of data. Managing the [...]
-<p>In this blog post we discuss how Apache Flink manages memory, talk about 
its custom data de/serialization stack, and show how it operates on binary 
data.</p></p>
-
-      <p><a href="/news/2015/05/11/Juggling-with-Bits-and-Bytes.html">Continue 
reading &raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -384,6 +383,16 @@ vertex-centric or gather-sum-apply to Flink dataflows.</p>
 
     <ul id="markdown-toc">
       
+      <li><a href="/2021/07/07/backpressure.html">How to identify the source 
of backpressure?</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2021/05/28/release-1.13.1.html">Apache Flink 1.13.1 
Released</a></li>
 
       
diff --git a/content/blog/page15/index.html b/content/blog/page15/index.html
index 79b69bf..1a821be 100644
--- a/content/blog/page15/index.html
+++ b/content/blog/page15/index.html
@@ -201,6 +201,20 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2015/05/11/Juggling-with-Bits-and-Bytes.html">Juggling with Bits 
and Bytes</a></h2>
+
+      <p>11 May 2015 by Fabian Hüske (<a 
href="https://twitter.com/";>@fhueske</a>)
+      </p>
+
+      <p><p>Nowadays, a lot of open-source systems for analyzing large data 
sets are implemented in Java or other JVM-based programming languages. The most 
well-known example is Apache Hadoop, but also newer frameworks such as Apache 
Spark, Apache Drill, and also Apache Flink run on JVMs. A common challenge that 
JVM-based data analysis engines face is to store large amounts of data in 
memory - both for caching and for efficient processing such as sorting and 
joining of data. Managing the [...]
+<p>In this blog post we discuss how Apache Flink manages memory, talk about 
its custom data de/serialization stack, and show how it operates on binary 
data.</p></p>
+
+      <p><a href="/news/2015/05/11/Juggling-with-Bits-and-Bytes.html">Continue 
reading &raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2015/04/13/release-0.9.0-milestone1.html">Announcing Flink 
0.9.0-milestone1 preview release</a></h2>
 
       <p>13 Apr 2015
@@ -345,21 +359,6 @@ and offers a new API including definition of flexible 
windows.</p>
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2014/11/04/release-0.7.0.html">Apache Flink 0.7.0 available</a></h2>
-
-      <p>04 Nov 2014
-      </p>
-
-      <p><p>We are pleased to announce the availability of Flink 0.7.0. This 
release includes new user-facing features as well as performance and bug fixes, 
brings the Scala and Java APIs in sync, and introduces Flink Streaming. A total 
of 34 people have contributed to this release, a big thanks to all of them!</p>
-
-</p>
-
-      <p><a href="/news/2014/11/04/release-0.7.0.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -392,6 +391,16 @@ and offers a new API including definition of flexible 
windows.</p>
 
     <ul id="markdown-toc">
       
+      <li><a href="/2021/07/07/backpressure.html">How to identify the source 
of backpressure?</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2021/05/28/release-1.13.1.html">Apache Flink 1.13.1 
Released</a></li>
 
       
diff --git a/content/blog/page16/index.html b/content/blog/page16/index.html
index 5dce80c..03c1059 100644
--- a/content/blog/page16/index.html
+++ b/content/blog/page16/index.html
@@ -201,6 +201,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2014/11/04/release-0.7.0.html">Apache Flink 0.7.0 available</a></h2>
+
+      <p>04 Nov 2014
+      </p>
+
+      <p><p>We are pleased to announce the availability of Flink 0.7.0. This 
release includes new user-facing features as well as performance and bug fixes, 
brings the Scala and Java APIs in sync, and introduces Flink Streaming. A total 
of 34 people have contributed to this release, a big thanks to all of them!</p>
+
+</p>
+
+      <p><a href="/news/2014/11/04/release-0.7.0.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2014/10/03/upcoming_events.html">Upcoming Events</a></h2>
 
       <p>03 Oct 2014
@@ -280,6 +295,16 @@ academic and open source project that Flink originates 
from.</p>
 
     <ul id="markdown-toc">
       
+      <li><a href="/2021/07/07/backpressure.html">How to identify the source 
of backpressure?</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2021/05/28/release-1.13.1.html">Apache Flink 1.13.1 
Released</a></li>
 
       
diff --git a/content/blog/page2/index.html b/content/blog/page2/index.html
index 091a3d3..5a770ab 100644
--- a/content/blog/page2/index.html
+++ b/content/blog/page2/index.html
@@ -201,6 +201,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2021/01/29/release-1.10.3.html">Apache Flink 1.10.3 
Released</a></h2>
+
+      <p>29 Jan 2021
+       Xintong Song </p>
+
+      <p><p>The Apache Flink community released the third bugfix version of 
the Apache Flink 1.10 series.</p>
+
+</p>
+
+      <p><a href="/news/2021/01/29/release-1.10.3.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2021/01/19/release-1.12.1.html">Apache Flink 1.12.1 
Released</a></h2>
 
       <p>19 Jan 2021
@@ -325,19 +340,6 @@
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/2020/10/15/from-aligned-to-unaligned-checkpoints-part-1.html">From 
Aligned to Unaligned Checkpoints - Part 1: Checkpoints, Alignment, and 
Backpressure</a></h2>
-
-      <p>15 Oct 2020
-       Arvid Heise  &amp; Stephan Ewen </p>
-
-      <p>Apache Flink’s checkpoint-based fault tolerance mechanism is one of 
its defining features. Because of that design, Flink unifies batch and stream 
processing, can easily scale to both very small and extremely large scenarios 
and provides support for many operational features. In this post we recap the 
original checkpointing process in Flink, its core properties and issues under 
backpressure.</p>
-
-      <p><a 
href="/2020/10/15/from-aligned-to-unaligned-checkpoints-part-1.html">Continue 
reading &raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -370,6 +372,16 @@
 
     <ul id="markdown-toc">
       
+      <li><a href="/2021/07/07/backpressure.html">How to identify the source 
of backpressure?</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2021/05/28/release-1.13.1.html">Apache Flink 1.13.1 
Released</a></li>
 
       
diff --git a/content/blog/page3/index.html b/content/blog/page3/index.html
index 28673c4..1a2c858 100644
--- a/content/blog/page3/index.html
+++ b/content/blog/page3/index.html
@@ -201,6 +201,19 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/2020/10/15/from-aligned-to-unaligned-checkpoints-part-1.html">From 
Aligned to Unaligned Checkpoints - Part 1: Checkpoints, Alignment, and 
Backpressure</a></h2>
+
+      <p>15 Oct 2020
+       Arvid Heise  &amp; Stephan Ewen </p>
+
+      <p>Apache Flink’s checkpoint-based fault tolerance mechanism is one of 
its defining features. Because of that design, Flink unifies batch and stream 
processing, can easily scale to both very small and extremely large scenarios 
and provides support for many operational features. In this post we recap the 
original checkpointing process in Flink, its core properties and issues under 
backpressure.</p>
+
+      <p><a 
href="/2020/10/15/from-aligned-to-unaligned-checkpoints-part-1.html">Continue 
reading &raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2020/10/13/stateful-serverless-internals.html">Stateful Functions 
Internals: Behind the scenes of Stateful Serverless</a></h2>
 
       <p>13 Oct 2020
@@ -329,19 +342,6 @@ as well as increased observability for operational 
purposes.</p>
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/2020/08/04/pyflink-pandas-udf-support-flink.html">PyFlink: The 
integration of Pandas into PyFlink</a></h2>
-
-      <p>04 Aug 2020
-       Jincheng Sun (<a 
href="https://twitter.com/sunjincheng121";>@sunjincheng121</a>) &amp; Markos 
Sfikas (<a href="https://twitter.com/MarkSfik";>@MarkSfik</a>)</p>
-
-      <p>The Apache Flink community put some great effort into integrating 
Pandas with PyFlink in the latest Flink version 1.11. Some of the added 
features include support for Pandas UDF and the conversion between Pandas 
DataFrame and Table. In this article, we will introduce how these 
functionalities work and how to use them with a step-by-step example.</p>
-
-      <p><a href="/2020/08/04/pyflink-pandas-udf-support-flink.html">Continue 
reading &raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -374,6 +374,16 @@ as well as increased observability for operational 
purposes.</p>
 
     <ul id="markdown-toc">
       
+      <li><a href="/2021/07/07/backpressure.html">How to identify the source 
of backpressure?</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2021/05/28/release-1.13.1.html">Apache Flink 1.13.1 
Released</a></li>
 
       
diff --git a/content/blog/page4/index.html b/content/blog/page4/index.html
index fd0830c..636dbc3 100644
--- a/content/blog/page4/index.html
+++ b/content/blog/page4/index.html
@@ -201,6 +201,19 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/2020/08/04/pyflink-pandas-udf-support-flink.html">PyFlink: The 
integration of Pandas into PyFlink</a></h2>
+
+      <p>04 Aug 2020
+       Jincheng Sun (<a 
href="https://twitter.com/sunjincheng121";>@sunjincheng121</a>) &amp; Markos 
Sfikas (<a href="https://twitter.com/MarkSfik";>@MarkSfik</a>)</p>
+
+      <p>The Apache Flink community put some great effort into integrating 
Pandas with PyFlink in the latest Flink version 1.11. Some of the added 
features include support for Pandas UDF and the conversion between Pandas 
DataFrame and Table. In this article, we will introduce how these 
functionalities work and how to use them with a step-by-step example.</p>
+
+      <p><a href="/2020/08/04/pyflink-pandas-udf-support-flink.html">Continue 
reading &raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2020/07/30/demo-fraud-detection-3.html">Advanced Flink Application 
Patterns Vol.3: Custom Window Processing</a></h2>
 
       <p>30 Jul 2020
@@ -335,19 +348,6 @@ and provide a tutorial for running Streaming ETL with 
Flink on Zeppelin.</p>
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2020/06/11/community-update.html">Flink Community Update - 
June'20</a></h2>
-
-      <p>11 Jun 2020
-       Marta Paes (<a href="https://twitter.com/morsapaes";>@morsapaes</a>)</p>
-
-      <p>And suddenly it’s June. The previous month has been calm on the 
surface, but quite hectic underneath — the final testing phase for Flink 1.11 
is moving at full speed, Stateful Functions 2.1 is out in the wild and Flink 
has made it into Google Season of Docs 2020.</p>
-
-      <p><a href="/news/2020/06/11/community-update.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -380,6 +380,16 @@ and provide a tutorial for running Streaming ETL with 
Flink on Zeppelin.</p>
 
     <ul id="markdown-toc">
       
+      <li><a href="/2021/07/07/backpressure.html">How to identify the source 
of backpressure?</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2021/05/28/release-1.13.1.html">Apache Flink 1.13.1 
Released</a></li>
 
       
diff --git a/content/blog/page5/index.html b/content/blog/page5/index.html
index fbede13..2246b0f 100644
--- a/content/blog/page5/index.html
+++ b/content/blog/page5/index.html
@@ -201,6 +201,19 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2020/06/11/community-update.html">Flink Community Update - 
June'20</a></h2>
+
+      <p>11 Jun 2020
+       Marta Paes (<a href="https://twitter.com/morsapaes";>@morsapaes</a>)</p>
+
+      <p>And suddenly it’s June. The previous month has been calm on the 
surface, but quite hectic underneath — the final testing phase for Flink 1.11 
is moving at full speed, Stateful Functions 2.1 is out in the wild and Flink 
has made it into Google Season of Docs 2020.</p>
+
+      <p><a href="/news/2020/06/11/community-update.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2020/06/09/release-statefun-2.1.0.html">Stateful Functions 2.1.0 
Release Announcement</a></h2>
 
       <p>09 Jun 2020
@@ -326,19 +339,6 @@ This release marks a big milestone: Stateful Functions 2.0 
is not only an API up
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2020/04/01/community-update.html">Flink Community Update - 
April'20</a></h2>
-
-      <p>01 Apr 2020
-       Marta Paes (<a href="https://twitter.com/morsapaes";>@morsapaes</a>)</p>
-
-      <p>While things slow down around us, the Apache Flink community is 
privileged to remain as active as ever. This blogpost combs through the past 
few months to give you an update on the state of things in Flink — from core 
releases to Stateful Functions; from some good old community stats to a new 
development blog.</p>
-
-      <p><a href="/news/2020/04/01/community-update.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -371,6 +371,16 @@ This release marks a big milestone: Stateful Functions 2.0 
is not only an API up
 
     <ul id="markdown-toc">
       
+      <li><a href="/2021/07/07/backpressure.html">How to identify the source 
of backpressure?</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2021/05/28/release-1.13.1.html">Apache Flink 1.13.1 
Released</a></li>
 
       
diff --git a/content/blog/page6/index.html b/content/blog/page6/index.html
index bb47949..735cada 100644
--- a/content/blog/page6/index.html
+++ b/content/blog/page6/index.html
@@ -201,6 +201,19 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2020/04/01/community-update.html">Flink Community Update - 
April'20</a></h2>
+
+      <p>01 Apr 2020
+       Marta Paes (<a href="https://twitter.com/morsapaes";>@morsapaes</a>)</p>
+
+      <p>While things slow down around us, the Apache Flink community is 
privileged to remain as active as ever. This blogpost combs through the past 
few months to give you an update on the state of things in Flink — from core 
releases to Stateful Functions; from some good old community stats to a new 
development blog.</p>
+
+      <p><a href="/news/2020/04/01/community-update.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/features/2020/03/27/flink-for-data-warehouse.html">Flink as Unified 
Engine for Modern Data Warehousing: Production-Ready Hive Integration</a></h2>
 
       <p>27 Mar 2020
@@ -323,21 +336,6 @@
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2019/12/11/release-1.8.3.html">Apache Flink 1.8.3 Released</a></h2>
-
-      <p>11 Dec 2019
-       Hequn Cheng </p>
-
-      <p><p>The Apache Flink community released the third bugfix version of 
the Apache Flink 1.8 series.</p>
-
-</p>
-
-      <p><a href="/news/2019/12/11/release-1.8.3.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -370,6 +368,16 @@
 
     <ul id="markdown-toc">
       
+      <li><a href="/2021/07/07/backpressure.html">How to identify the source 
of backpressure?</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2021/05/28/release-1.13.1.html">Apache Flink 1.13.1 
Released</a></li>
 
       
diff --git a/content/blog/page7/index.html b/content/blog/page7/index.html
index 0ad6d47..698af1e 100644
--- a/content/blog/page7/index.html
+++ b/content/blog/page7/index.html
@@ -201,6 +201,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2019/12/11/release-1.8.3.html">Apache Flink 1.8.3 Released</a></h2>
+
+      <p>11 Dec 2019
+       Hequn Cheng </p>
+
+      <p><p>The Apache Flink community released the third bugfix version of 
the Apache Flink 1.8 series.</p>
+
+</p>
+
+      <p><a href="/news/2019/12/11/release-1.8.3.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2019/12/09/flink-kubernetes-kudo.html">Running Apache Flink on 
Kubernetes with KUDO</a></h2>
 
       <p>09 Dec 2019
@@ -326,19 +341,6 @@
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a href="/2019/06/26/broadcast-state.html">A 
Practical Guide to Broadcast State in Apache Flink</a></h2>
-
-      <p>26 Jun 2019
-       Fabian Hueske (<a href="https://twitter.com/fhueske";>@fhueske</a>)</p>
-
-      <p>Apache Flink has multiple types of operator state, one of which is 
called Broadcast State. In this post, we explain what Broadcast State is, and 
show an example of how it can be applied to an application that evaluates 
dynamic patterns on an event stream.</p>
-
-      <p><a href="/2019/06/26/broadcast-state.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -371,6 +373,16 @@
 
     <ul id="markdown-toc">
       
+      <li><a href="/2021/07/07/backpressure.html">How to identify the source 
of backpressure?</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2021/05/28/release-1.13.1.html">Apache Flink 1.13.1 
Released</a></li>
 
       
diff --git a/content/blog/page8/index.html b/content/blog/page8/index.html
index 70f9c98..1cd63ff 100644
--- a/content/blog/page8/index.html
+++ b/content/blog/page8/index.html
@@ -201,6 +201,19 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a href="/2019/06/26/broadcast-state.html">A 
Practical Guide to Broadcast State in Apache Flink</a></h2>
+
+      <p>26 Jun 2019
+       Fabian Hueske (<a href="https://twitter.com/fhueske";>@fhueske</a>)</p>
+
+      <p>Apache Flink has multiple types of operator state, one of which is 
called Broadcast State. In this post, we explain what Broadcast State is, and 
show an example of how it can be applied to an application that evaluates 
dynamic patterns on an event stream.</p>
+
+      <p><a href="/2019/06/26/broadcast-state.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a href="/2019/06/05/flink-network-stack.html">A 
Deep-Dive into Flink's Network Stack</a></h2>
 
       <p>05 Jun 2019
@@ -325,21 +338,6 @@ for more details.</p>
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2019/02/25/release-1.6.4.html">Apache Flink 1.6.4 Released</a></h2>
-
-      <p>25 Feb 2019
-      </p>
-
-      <p><p>The Apache Flink community released the fourth bugfix version of 
the Apache Flink 1.6 series.</p>
-
-</p>
-
-      <p><a href="/news/2019/02/25/release-1.6.4.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -372,6 +370,16 @@ for more details.</p>
 
     <ul id="markdown-toc">
       
+      <li><a href="/2021/07/07/backpressure.html">How to identify the source 
of backpressure?</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2021/05/28/release-1.13.1.html">Apache Flink 1.13.1 
Released</a></li>
 
       
diff --git a/content/blog/page9/index.html b/content/blog/page9/index.html
index 14d2492..7bed33b 100644
--- a/content/blog/page9/index.html
+++ b/content/blog/page9/index.html
@@ -201,6 +201,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2019/02/25/release-1.6.4.html">Apache Flink 1.6.4 Released</a></h2>
+
+      <p>25 Feb 2019
+      </p>
+
+      <p><p>The Apache Flink community released the fourth bugfix version of 
the Apache Flink 1.6 series.</p>
+
+</p>
+
+      <p><a href="/news/2019/02/25/release-1.6.4.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2019/02/15/release-1.7.2.html">Apache Flink 1.7.2 Released</a></h2>
 
       <p>15 Feb 2019
@@ -335,21 +350,6 @@ Please check the <a 
href="https://issues.apache.org/jira/secure/ReleaseNote.jspa
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2018/09/20/release-1.5.4.html">Apache Flink 1.5.4 Released</a></h2>
-
-      <p>20 Sep 2018
-      </p>
-
-      <p><p>The Apache Flink community released the fourth bugfix version of 
the Apache Flink 1.5 series.</p>
-
-</p>
-
-      <p><a href="/news/2018/09/20/release-1.5.4.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -382,6 +382,16 @@ Please check the <a 
href="https://issues.apache.org/jira/secure/ReleaseNote.jspa
 
     <ul id="markdown-toc">
       
+      <li><a href="/2021/07/07/backpressure.html">How to identify the source 
of backpressure?</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2021/05/28/release-1.13.1.html">Apache Flink 1.13.1 
Released</a></li>
 
       
diff --git a/content/index.html b/content/index.html
index e356c66..fbc42c6 100644
--- a/content/index.html
+++ b/content/index.html
@@ -577,6 +577,9 @@
 
   <dl>
       
+        <dt> <a href="/2021/07/07/backpressure.html">How to identify the 
source of backpressure?</a></dt>
+        <dd>Apache Flink 1.13 introduced a couple of important changes in the 
area of backpressure monitoring and performance analysing of Flink Jobs. This 
blog post aims to introduce those changes and explain how to use them.</dd>
+      
         <dt> <a href="/news/2021/05/28/release-1.13.1.html">Apache Flink 
1.13.1 Released</a></dt>
         <dd><p>The Apache Flink community released the first bugfix version of 
the Apache Flink 1.13 series.</p>
 
@@ -592,11 +595,6 @@
       
         <dt> <a href="/news/2021/05/03/release-1.13.0.html">Apache Flink 
1.13.0 Release Announcement</a></dt>
         <dd>The Apache Flink community is excited to announce the release of 
Flink 1.13.0! Around 200 contributors worked on over 1,000 issues to bring 
significant improvements to usability and observability as well as new features 
that improve the elasticity of Flink's Application-style deployments.</dd>
-      
-        <dt> <a href="/news/2021/04/29/release-1.12.3.html">Apache Flink 
1.12.3 Released</a></dt>
-        <dd><p>The Apache Flink community released the next bugfix version of 
the Apache Flink 1.12 series.</p>
-
-</dd>
     
   </dl>
 
diff --git a/content/zh/index.html b/content/zh/index.html
index fece035..3296781 100644
--- a/content/zh/index.html
+++ b/content/zh/index.html
@@ -570,6 +570,9 @@
 
   <dl>
       
+        <dt> <a href="/2021/07/07/backpressure.html">How to identify the 
source of backpressure?</a></dt>
+        <dd>Apache Flink 1.13 introduced a couple of important changes in the 
area of backpressure monitoring and performance analysing of Flink Jobs. This 
blog post aims to introduce those changes and explain how to use them.</dd>
+      
         <dt> <a href="/news/2021/05/28/release-1.13.1.html">Apache Flink 
1.13.1 Released</a></dt>
         <dd><p>The Apache Flink community released the first bugfix version of 
the Apache Flink 1.13 series.</p>
 
@@ -585,11 +588,6 @@
       
         <dt> <a href="/news/2021/05/03/release-1.13.0.html">Apache Flink 
1.13.0 Release Announcement</a></dt>
         <dd>The Apache Flink community is excited to announce the release of 
Flink 1.13.0! Around 200 contributors worked on over 1,000 issues to bring 
significant improvements to usability and observability as well as new features 
that improve the elasticity of Flink's Application-style deployments.</dd>
-      
-        <dt> <a href="/news/2021/04/29/release-1.12.3.html">Apache Flink 
1.12.3 Released</a></dt>
-        <dd><p>The Apache Flink community released the next bugfix version of 
the Apache Flink 1.12 series.</p>
-
-</dd>
     
   </dl>
 
diff --git a/img/blog/2021-07-07-backpressure/animated.png 
b/img/blog/2021-07-07-backpressure/animated.png
new file mode 100644
index 0000000..40ea809
Binary files /dev/null and b/img/blog/2021-07-07-backpressure/animated.png 
differ
diff --git a/img/blog/2021-07-07-backpressure/bottleneck-zoom.png 
b/img/blog/2021-07-07-backpressure/bottleneck-zoom.png
new file mode 100644
index 0000000..b4e5b80
Binary files /dev/null and 
b/img/blog/2021-07-07-backpressure/bottleneck-zoom.png differ
diff --git a/img/blog/2021-07-07-backpressure/simple-example.png 
b/img/blog/2021-07-07-backpressure/simple-example.png
new file mode 100644
index 0000000..fceccb3
Binary files /dev/null and 
b/img/blog/2021-07-07-backpressure/simple-example.png differ
diff --git a/img/blog/2021-07-07-backpressure/sliding-window.png 
b/img/blog/2021-07-07-backpressure/sliding-window.png
new file mode 100644
index 0000000..dcd3240
Binary files /dev/null and 
b/img/blog/2021-07-07-backpressure/sliding-window.png differ
diff --git a/img/blog/2021-07-07-backpressure/source-task-busy.png 
b/img/blog/2021-07-07-backpressure/source-task-busy.png
new file mode 100644
index 0000000..8f72b54
Binary files /dev/null and 
b/img/blog/2021-07-07-backpressure/source-task-busy.png differ
diff --git a/img/blog/2021-07-07-backpressure/subtasks.png 
b/img/blog/2021-07-07-backpressure/subtasks.png
new file mode 100644
index 0000000..bf6ede6
Binary files /dev/null and b/img/blog/2021-07-07-backpressure/subtasks.png 
differ

[flink-web] 01/02: Backpressure monitoring and analysis blog post

Reply via email to