[flink-web] 05/05: Rebuild website

nkruber Tue, 23 Jul 2019 08:48:44 -0700

This is an automated email from the ASF dual-hosted git repository.

nkruber pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/flink-web.git


commit 606b5df34df5a3cc2ba216368de810a0c37b8247
Author: Nico Kruber <n...@ververica.com>
AuthorDate: Tue Jul 23 17:47:00 2019 +0200

    Rebuild website
---
 content/2019/07/23/flink-network-stack-2.html      | 579 +++++++++++++++++++++
 content/blog/feed.xml                              | 360 +++++++++++++
 content/blog/index.html                            |  36 +-
 content/blog/page2/index.html                      |  38 +-
 content/blog/page3/index.html                      |  38 +-
 content/blog/page4/index.html                      |  38 +-
 content/blog/page5/index.html                      |  40 +-
 content/blog/page6/index.html                      |  40 +-
 content/blog/page7/index.html                      |  40 +-
 content/blog/page8/index.html                      |  40 +-
 content/blog/page9/index.html                      |  25 +
 content/css/flink.css                              |   5 +
 .../back_pressure_sampling_high.png                | Bin 0 -> 77546 bytes
 content/index.html                                 |   6 +-
 content/roadmap.html                               |   4 +-
 content/zh/community.html                          |   6 +
 content/zh/index.html                              |   6 +-
 17 files changed, 1177 insertions(+), 124 deletions(-)

diff --git a/content/2019/07/23/flink-network-stack-2.html 
b/content/2019/07/23/flink-network-stack-2.html
new file mode 100644
index 0000000..3601198
--- /dev/null
+++ b/content/2019/07/23/flink-network-stack-2.html
@@ -0,0 +1,579 @@
+<!DOCTYPE html>
+<html lang="en">
+  <head>
+    <meta charset="utf-8">
+    <meta http-equiv="X-UA-Compatible" content="IE=edge">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <!-- The above 3 meta tags *must* come first in the head; any other head 
content must come *after* these tags -->
+    <title>Apache Flink: Flink Network Stack Vol. 2: Monitoring, Metrics, and 
that Backpressure Thing</title>
+    <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon">
+    <link rel="icon" href="/favicon.ico" type="image/x-icon">
+
+    <!-- Bootstrap -->
+    <link rel="stylesheet" 
href="https://maxcdn.bootstrapcdn.com/bootstrap/3.4.1/css/bootstrap.min.css";>
+    <link rel="stylesheet" href="/css/flink.css">
+    <link rel="stylesheet" href="/css/syntax.css">
+
+    <!-- Blog RSS feed -->
+    <link href="/blog/feed.xml" rel="alternate" type="application/rss+xml" 
title="Apache Flink Blog: RSS feed" />
+
+    <!-- jQuery (necessary for Bootstrap's JavaScript plugins) -->
+    <!-- We need to load Jquery in the header for custom google analytics 
event tracking-->
+    <script src="/js/jquery.min.js"></script>
+
+    <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media 
queries -->
+    <!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
+    <!--[if lt IE 9]>
+      <script 
src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js";></script>
+      <script 
src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js";></script>
+    <![endif]-->
+  </head>
+  <body>  
+    
+
+    <!-- Main content. -->
+    <div class="container">
+    <div class="row">
+
+      
+     <div id="sidebar" class="col-sm-3">
+        
+
+<!-- Top navbar. -->
+    <nav class="navbar navbar-default">
+        <!-- The logo. -->
+        <div class="navbar-header">
+          <button type="button" class="navbar-toggle collapsed" 
data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+          </button>
+          <div class="navbar-logo">
+            <a href="/">
+              <img alt="Apache Flink" src="/img/flink-header-logo.svg" 
width="147px" height="73px">
+            </a>
+          </div>
+        </div><!-- /.navbar-header -->
+
+        <!-- The navigation links. -->
+        <div class="collapse navbar-collapse" 
id="bs-example-navbar-collapse-1">
+          <ul class="nav navbar-nav navbar-main">
+
+            <!-- First menu section explains visitors what Flink is -->
+
+            <!-- What is Stream Processing? -->
+            <!--
+            <li><a href="/streamprocessing1.html">What is Stream 
Processing?</a></li>
+            -->
+
+            <!-- What is Flink? -->
+            <li><a href="/flink-architecture.html">What is Apache 
Flink?</a></li>
+
+            
+            <ul class="nav navbar-nav navbar-subnav">
+              <li >
+                  <a href="/flink-architecture.html">Architecture</a>
+              </li>
+              <li >
+                  <a href="/flink-applications.html">Applications</a>
+              </li>
+              <li >
+                  <a href="/flink-operations.html">Operations</a>
+              </li>
+            </ul>
+            
+
+            <!-- Use cases -->
+            <li><a href="/usecases.html">Use Cases</a></li>
+
+            <!-- Powered by -->
+            <li><a href="/poweredby.html">Powered By</a></li>
+
+            <!-- FAQ -->
+            <li><a href="/faq.html">FAQ</a></li>
+
+            &nbsp;
+            <!-- Second menu section aims to support Flink users -->
+
+            <!-- Downloads -->
+            <li><a href="/downloads.html">Downloads</a></li>
+
+            <!-- Quickstart -->
+            <li>
+              <a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.8/quickstart/setup_quickstart.html";
 target="_blank">Tutorials <small><span class="glyphicon 
glyphicon-new-window"></span></small></a>
+            </li>
+
+            <!-- Documentation -->
+            <li class="dropdown">
+              <a class="dropdown-toggle" data-toggle="dropdown" 
href="#">Documentation<span class="caret"></span></a>
+              <ul class="dropdown-menu">
+                <li><a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.8"; 
target="_blank">1.8 (Latest stable release) <small><span class="glyphicon 
glyphicon-new-window"></span></small></a></li>
+                <li><a 
href="https://ci.apache.org/projects/flink/flink-docs-master"; 
target="_blank">1.9 (Snapshot) <small><span class="glyphicon 
glyphicon-new-window"></span></small></a></li>
+              </ul>
+            </li>
+
+            <!-- getting help -->
+            <li><a href="/gettinghelp.html">Getting Help</a></li>
+
+            <!-- Blog -->
+            <li><a href="/blog/"><b>Flink Blog</b></a></li>
+
+            &nbsp;
+
+            <!-- Third menu section aim to support community and contributors 
-->
+
+            <!-- Community -->
+            <li><a href="/community.html">Community &amp; Project Info</a></li>
+
+            <!-- Roadmap -->
+            <li><a href="/roadmap.html">Roadmap</a></li>
+
+            <!-- Contribute -->
+            <li><a href="/contributing/how-to-contribute.html">How to 
Contribute</a></li>
+            
+
+            <!-- GitHub -->
+            <li>
+              <a href="https://github.com/apache/flink"; target="_blank">Flink 
on GitHub <small><span class="glyphicon 
glyphicon-new-window"></span></small></a>
+            </li>
+
+            &nbsp;
+
+            <!-- Language Switcher -->
+            <li>
+              
+                
+                  <a href="/zh/2019/07/23/flink-network-stack-2.html">中文版</a>
+                
+              
+            </li>
+
+          </ul>
+
+          <ul class="nav navbar-nav navbar-bottom">
+          <hr />
+
+            <!-- Twitter -->
+            <li><a href="https://twitter.com/apacheflink"; 
target="_blank">@ApacheFlink <small><span class="glyphicon 
glyphicon-new-window"></span></small></a></li>
+
+            <!-- Visualizer -->
+            <li class=" hidden-md hidden-sm"><a href="/visualizer/" 
target="_blank">Plan Visualizer <small><span class="glyphicon 
glyphicon-new-window"></span></small></a></li>
+
+          </ul>
+        </div><!-- /.navbar-collapse -->
+    </nav>
+
+      </div>
+      <div class="col-sm-9">
+      <div class="row-fluid">
+  <div class="col-sm-12">
+    <div class="row">
+      <h1>Flink Network Stack Vol. 2: Monitoring, Metrics, and that 
Backpressure Thing</h1>
+
+      <article>
+        <p>23 Jul 2019 Nico Kruber  &amp; Piotr Nowojski </p>
+
+<style type="text/css">
+.tg  {border-collapse:collapse;border-spacing:0;}
+.tg td{padding:10px 
10px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
+.tg th{padding:10px 
10px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;background-color:#eff0f1;}
+.tg .tg-wide{padding:10px 30px;}
+.tg .tg-top{vertical-align:top}
+.tg .tg-topcenter{text-align:center;vertical-align:top}
+.tg .tg-center{text-align:center;vertical-align:center}
+</style>
+
+<p>In a <a href="/2019/06/05/flink-network-stack.html">previous blog post</a>, 
we presented how Flink’s network stack works from the high-level abstractions 
to the low-level details. This second blog post in the series of network stack 
posts extends on this knowledge and discusses monitoring network-related 
metrics to identify effects such as backpressure or bottlenecks in throughput 
and latency. Although this post briefly covers what to do with backpressure, 
the topic of tuning the netw [...]
+
+<div class="page-toc">
+<ul id="markdown-toc">
+  <li><a href="#monitoring" id="markdown-toc-monitoring">Monitoring</a>    <ul>
+      <li><a href="#backpressure-monitor" 
id="markdown-toc-backpressure-monitor">Backpressure Monitor</a></li>
+    </ul>
+  </li>
+  <li><a href="#network-metrics" id="markdown-toc-network-metrics">Network 
Metrics</a>    <ul>
+      <li><a href="#backpressure" 
id="markdown-toc-backpressure">Backpressure</a></li>
+      <li><a href="#resource-usage--throughput" 
id="markdown-toc-resource-usage--throughput">Resource Usage / 
Throughput</a></li>
+      <li><a href="#latency-tracking" 
id="markdown-toc-latency-tracking">Latency Tracking</a></li>
+    </ul>
+  </li>
+  <li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
+</ul>
+
+</div>
+
+<h2 id="monitoring">Monitoring</h2>
+
+<p>Probably the most important part of network monitoring is <a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/back_pressure.html";>monitoring
 backpressure</a>, a situation where a system is receiving data at a higher 
rate than it can process¹. Such behaviour will result in the sender being 
backpressured and may be caused by two things:</p>
+
+<ul>
+  <li>
+    <p>The receiver is slow.<br />
+This can happen because the receiver is backpressured itself, is unable to 
keep processing at the same rate as the sender, or is temporarily blocked by 
garbage collection, lack of system resources, or I/O.</p>
+  </li>
+  <li>
+    <p>The network channel is slow.<br />
+  Even though in such case the receiver is not (directly) involved, we call 
the sender backpressured due to a potential oversubscription on network 
bandwidth shared by all subtasks running on the same machine. Beware that, in 
addition to Flink’s network stack, there may be more network users, such as 
sources and sinks, distributed file systems (checkpointing, network-attached 
storage), logging, and metrics. A previous <a 
href="https://www.ververica.com/blog/how-to-size-your-apache-flink- [...]
+  </li>
+</ul>
+
+<p><sup>1</sup> In case you are unfamiliar with backpressure and how it 
interacts with Flink, we recommend reading through <a 
href="https://www.ververica.com/blog/how-flink-handles-backpressure";>this blog 
post on backpressure</a> from 2015.</p>
+
+<p><br />
+If backpressure occurs, it will bubble upstream and eventually reach your 
sources and slow them down. This is not a bad thing per-se and merely states 
that you lack resources for the current load. However, you may want to improve 
your job so that it can cope with higher loads without using more resources. In 
order to do so, you need to find (1) where (at which task/operator) the 
bottleneck is and (2) what is causing it. Flink offers two mechanisms for 
identifying where the bottleneck is:</p>
+
+<ul>
+  <li>directly via Flink’s web UI and its backpressure monitor, or</li>
+  <li>indirectly through some of the network metrics.</li>
+</ul>
+
+<p>Flink’s web UI is likely the first entry point for a quick troubleshooting 
but has some disadvantages that we will explain below. On the other hand, 
Flink’s network metrics are better suited for continuous monitoring and 
reasoning about the exact nature of the bottleneck causing backpressure. We 
will cover both in the sections below. In both cases, you need to identify the 
origin of backpressure from the sources to the sinks. Your starting point for 
the current and future investigatio [...]
+
+<h3 id="backpressure-monitor">Backpressure Monitor</h3>
+
+<p>The <a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/back_pressure.html";>backpressure
 monitor</a> is only exposed via Flink’s web UI². Since it’s an active 
component that is only triggered on request, it is currently not available via 
metrics. The backpressure monitor samples the running tasks’ threads on all 
TaskManagers via <code>Thread.getStackTrace()</code> and computes the number of 
samples where tasks were blocked on a buffer request. These tasks w [...]
+
+<ul>
+  <li><span style="color:green">OK</span> for <code>ratio ≤ 0.10</code>,</li>
+  <li><span style="color:orange">LOW</span> for <code>0.10 &lt; Ratio ≤ 
0.5</code>, and</li>
+  <li><span style="color:red">HIGH</span> for <code>0.5 &lt; Ratio ≤ 
1</code>.</li>
+</ul>
+
+<p>Although you can tune things like the refresh-interval, the number of 
samples, or the delay between samples, normally, you would not need to touch 
these since the defaults already give good-enough results.</p>
+
+<center>
+<img 
src="/img/blog/2019-07-23-network-stack-2/back_pressure_sampling_high.png" 
width="600px" alt="Backpressure sampling:high" />
+</center>
+
+<p><sup>2</sup> You may also access the backpressure monitor via the REST API: 
<code>/jobs/:jobid/vertices/:vertexid/backpressure</code></p>
+
+<p><br />
+The backpressure monitor can help you find where (at which task/operator) 
backpressure originates from. However, it does not support you in further 
reasoning about the causes of it. Additionally, for larger jobs or higher 
parallelism, the backpressure monitor becomes too crowded to use and may also 
take some time to gather all information from all TaskManagers. Please also 
note that sampling may affect your running job’s performance.</p>
+
+<h2 id="network-metrics">Network Metrics</h2>
+
+<p><a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/metrics.html#network";>Network</a>
 and <a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/metrics.html#io";>task
 I/O</a> metrics are more lightweight than the backpressure monitor and are 
continuously published for each running job. We can leverage those and get even 
more insights, not only for backpressure monitoring. The most relevant metrics 
for users are:</p>
+
+<ul>
+  <li>
+    <p><strong><span style="color:orange">up to Flink 1.8:</span></strong> 
<code>outPoolUsage</code>, <code>inPoolUsage</code><br />
+An estimate on the ratio of buffers used vs. buffers available in the 
respective local buffer pools.
+While interpreting <code>inPoolUsage</code> in Flink 1.5 - 1.8 with 
credit-based flow control, please note that this only relates to floating 
buffers (exclusive buffers are not part of the pool).</p>
+  </li>
+  <li>
+    <p><strong><span style="color:green">Flink 1.9 and above:</span></strong> 
<code>outPoolUsage</code>, <code>inPoolUsage</code>, 
<code>floatingBuffersUsage</code>, <code>exclusiveBuffersUsage</code><br />
+An estimate on the ratio of buffers used vs. buffers available in the 
respective local buffer pools.
+Starting with Flink 1.9, <code>inPoolUsage</code> is the sum of 
<code>floatingBuffersUsage</code> and <code>exclusiveBuffersUsage</code>.</p>
+  </li>
+  <li>
+    <p><code>numRecordsOut</code>, <code>numRecordsIn</code><br />
+Each metric comes with two scopes: one scoped to the operator and one scoped 
to the subtask. For network monitoring, the subtask-scoped metric is relevant 
and shows the total number of records it has sent/received. You may need to 
further look into these figures to extract the number of records within a 
certain time span or use the equivalent <code>…PerSecond</code> metrics.</p>
+  </li>
+  <li>
+    <p><code>numBytesOut</code>, <code>numBytesInLocal</code>, 
<code>numBytesInRemote</code><br />
+The total number of bytes this subtask has emitted or read from a local/remote 
source. These are also available as meters via <code>…PerSecond</code> 
metrics.</p>
+  </li>
+  <li>
+    <p><code>numBuffersOut</code>, <code>numBuffersInLocal</code>, 
<code>numBuffersInRemote</code><br />
+Similar to <code>numBytes…</code> but counting the number of network 
buffers.</p>
+  </li>
+</ul>
+
+<div class="alert alert-warning">
+  <p><span class="label label-warning" style="display: inline-block"><span 
class="glyphicon glyphicon-warning-sign" aria-hidden="true"></span> 
Warning</span>
+For the sake of completeness and since they have been used in the past, we 
will briefly look at the <code>outputQueueLength</code> and 
<code>inputQueueLength</code> metrics. These are somewhat similar to the 
<code>[out,in]PoolUsage</code> metrics but show the number of buffers sitting 
in a sender subtask’s output queues and in a receiver subtask’s input queues, 
respectively. Reasoning about absolute numbers of buffers, however, is 
difficult and there is also a special subtlety with local [...]
+
+  <p>Overall, <strong>we discourage the use of</strong> 
<code>outputQueueLength</code> <strong>and</strong> 
<code>inputQueueLength</code> because their interpretation highly depends on 
the current parallelism of the operator and the configured numbers of exclusive 
and floating buffers. Instead, we recommend using the various 
<code>*PoolUsage</code> metrics which even reveal more detailed insight.</p>
+</div>
+
+<div class="alert alert-info">
+  <p><span class="label label-info" style="display: inline-block"><span 
class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> Note</span>
+ If you reason about buffer usage, please keep the following in mind:</p>
+
+  <ul>
+    <li>Any outgoing channel which has been used at least once will always 
occupy one buffer (since Flink 1.5).
+      <ul>
+        <li><strong><span style="color:orange">up to Flink 
1.8:</span></strong> This buffer (even if empty!) was always counted as a 
backlog of 1 and thus receivers tried to reserve a floating buffer for it.</li>
+        <li><strong><span style="color:green">Flink 1.9 and 
above:</span></strong> A buffer is only counted in the backlog if it is ready 
for consumption, i.e. it is full or was flushed (see FLINK-11082)</li>
+      </ul>
+    </li>
+    <li>The receiver will only release a received buffer after deserialising 
the last record in it.</li>
+  </ul>
+</div>
+
+<p>The following sections make use of and combine these metrics to reason 
about backpressure and resource usage / efficiency with respect to throughput. 
A separate section will detail latency related metrics.</p>
+
+<h3 id="backpressure">Backpressure</h3>
+
+<p>Backpressure may be indicated by two different sets of metrics: (local) 
buffer pool usages as well as input/output queue lengths. They provide a 
different level of granularity but, unfortunately, none of these are exhaustive 
and there is room for interpretation. Because of the inherent problems with 
interpreting these queue lengths we will focus on the usage of input and output 
pools below which also provides more detail.</p>
+
+<ul>
+  <li>
+    <p><strong>If a subtask’s</strong> <code>outPoolUsage</code> <strong>is 
100%</strong>, it is backpressured. Whether the subtask is already blocking or 
still writing records into network buffers depends on how full the buffers are, 
that the <code>RecordWriters</code> are currently writing into.<br />
+<span class="glyphicon glyphicon-warning-sign" aria-hidden="true" 
style="color:orange;"></span> This is different to what the backpressure 
monitor is showing!</p>
+  </li>
+  <li>
+    <p>An <code>inPoolUsage</code> of 100% means that all floating buffers are 
assigned to channels and eventually backpressure will be exercised upstream. 
These floating buffers are in either of the following conditions: they are 
reserved for future use on a channel due to an exclusive buffer being utilised 
(remote input channels always try to maintain <code>#exclusive buffers</code> 
credits), they are reserved for a sender’s backlog and wait for data, they may 
contain data and are enqu [...]
+  </li>
+  <li>
+    <p><strong><span style="color:orange">up to Flink 1.8:</span></strong> Due 
to <a href="https://issues.apache.org/jira/browse/FLINK-11082";>FLINK-11082</a>, 
an <code>inPoolUsage</code> of 100% is quite common even in normal 
situations.</p>
+  </li>
+  <li>
+    <p><strong><span style="color:green">Flink 1.9 and above:</span></strong> 
If <code>inPoolUsage</code> is constantly around 100%, this is a strong 
indicator for exercising backpressure upstream.</p>
+  </li>
+</ul>
+
+<p>The following table summarises all combinations and their interpretation. 
Bear in mind, though, that backpressure may be minor or temporary (no need to 
look into it), on particular channels only, or caused by other JVM processes on 
a particular TaskManager, such as GC, synchronisation, I/O, resource shortage, 
instead of a specific subtask.</p>
+
+<center>
+<table class="tg">
+  <tr>
+    <th></th>
+    <th class="tg-center"><code>outPoolUsage</code> low</th>
+    <th class="tg-center"><code>outPoolUsage</code> high</th>
+  </tr>
+  <tr>
+    <th class="tg-top"><code>inPoolUsage</code> low</th>
+    <td class="tg-topcenter">
+      <span class="glyphicon glyphicon-ok-sign" aria-hidden="true" 
style="color:green;font-size:1.5em;"></span></td>
+    <td class="tg-topcenter">
+      <span class="glyphicon glyphicon-warning-sign" aria-hidden="true" 
style="color:orange;font-size:1.5em;"></span><br />
+      (backpressured, temporary situation: upstream is not backpressured yet 
or not anymore)</td>
+  </tr>
+  <tr>
+    <th class="tg-top" rowspan="2">
+      <code>inPoolUsage</code> high<br />
+      (<strong><span style="color:green">Flink 1.9+</span></strong>)</th>
+    <td class="tg-topcenter">
+      if all upstream tasks’<code>outPoolUsage</code> are low: <span 
class="glyphicon glyphicon-warning-sign" aria-hidden="true" 
style="color:orange;font-size:1.5em;"></span><br />
+      (may eventually cause backpressure)</td>
+    <td class="tg-topcenter" rowspan="2">
+      <span class="glyphicon glyphicon-remove-sign" aria-hidden="true" 
style="color:red;font-size:1.5em;"></span><br />
+      (backpressured by downstream task(s) or network, probably forwarding 
backpressure upstream)</td>
+  </tr>
+  <tr>
+    <td class="tg-topcenter">if any upstream task’s<code>outPoolUsage</code> 
is high: <span class="glyphicon glyphicon-remove-sign" aria-hidden="true" 
style="color:red;font-size:1.5em;"></span><br />
+      (may exercise backpressure upstream and may be the source of 
backpressure)</td>
+  </tr>
+</table>
+</center>
+
+<p><br />
+We may even reason more about the cause of backpressure by looking at the 
network metrics of the subtasks of two consecutive tasks:</p>
+
+<ul>
+  <li>If all subtasks of the receiver task have low <code>inPoolUsage</code> 
values and any upstream subtask’s <code>outPoolUsage</code> is high, then there 
may be a network bottleneck causing backpressure.
+Since network is a shared resource among all subtasks of a TaskManager, this 
may not directly originate from this subtask, but rather from various 
concurrent operations, e.g. checkpoints, other streams, external connections, 
or other TaskManagers/processes on the same machine.</li>
+</ul>
+
+<p>Backpressure can also be caused by all parallel instances of a task or by a 
single task instance. The first usually happens because the task is performing 
some time consuming operation that applies to all input partitions. The latter 
is usually the result of some kind of skew, either data skew or resource 
availability/allocation skew. In either case, you can find some hints on how to 
handle such situations in the <a 
href="#span-classlabel-label-info-styledisplay-inline-blockspan-class [...]
+
+<div class="alert alert-info">
+  <h3 class="no_toc" 
id="span-classglyphicon-glyphicon-info-sign-aria-hiddentruespan-flink-19-and-above"><span
 class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> Flink 1.9 and 
above</h3>
+
+  <ul>
+    <li>If <code>floatingBuffersUsage</code> is not 100%, it is unlikely that 
there is backpressure. If it is 100% and any upstream task is backpressured, it 
suggests that this input is exercising backpressure on either a single, some or 
all input channels. To differentiate between those three situations you can use 
<code>exclusiveBuffersUsage</code>:
+      <ul>
+        <li>Assuming that <code>floatingBuffersUsage</code> is around 100%, 
the higher the <code>exclusiveBuffersUsage</code> the more input channels are 
backpressured. In an extreme case of <code>exclusiveBuffersUsage</code> being 
close to 100%, it means that all channels are backpressured.</li>
+      </ul>
+    </li>
+  </ul>
+
+  <p><br />
+The relation between <code>exclusiveBuffersUsage</code>, 
<code>floatingBuffersUsage</code>, and the upstream tasks’ 
<code>outPoolUsage</code> is summarised in the following table and extends on 
the table above with <code>inPoolUsage = floatingBuffersUsage + 
exclusiveBuffersUsage</code>:</p>
+
+  <center>
+<table class="tg">
+  <tr>
+    <th></th>
+    <th><code>exclusiveBuffersUsage</code> low</th>
+    <th><code>exclusiveBuffersUsage</code> high</th>
+  </tr>
+  <tr>
+    <th class="tg-top" style="min-width:33%;">
+      <code>floatingBuffersUsage</code> low +<br />
+      <em>all</em> upstream <code>outPoolUsage</code> low</th>
+    <td class="tg-center"><span class="glyphicon glyphicon-ok-sign" 
aria-hidden="true" style="color:green;font-size:1.5em;"></span></td>
+    <td class="tg-center">-<sup>3</sup></td>
+  </tr>
+  <tr>
+    <th class="tg-top" style="min-width:33%;">
+      <code>floatingBuffersUsage</code> low +<br />
+      <em>any</em> upstream <code>outPoolUsage</code> high</th>
+    <td class="tg-center">
+      <span class="glyphicon glyphicon-remove-sign" aria-hidden="true" 
style="color:red;font-size:1.5em;"></span><br />
+      (potential network bottleneck)</td>
+    <td class="tg-center">-<sup>3</sup></td>
+  </tr>
+  <tr>
+    <th class="tg-top" style="min-width:33%;">
+      <code>floatingBuffersUsage</code> high +<br />
+      <em>all</em> upstream <code>outPoolUsage</code> low</th>
+    <td class="tg-center">
+      <span class="glyphicon glyphicon-warning-sign" aria-hidden="true" 
style="color:orange;font-size:1.5em;"></span><br />
+      (backpressure eventually appears on only some of the input channels)</td>
+    <td class="tg-center">
+      <span class="glyphicon glyphicon-warning-sign" aria-hidden="true" 
style="color:orange;font-size:1.5em;"></span><br />
+      (backpressure eventually appears on most or all of the input 
channels)</td>
+  </tr>
+  <tr>
+    <th class="tg-top" style="min-width:33%;">
+      <code>floatingBuffersUsage</code> high +<br />
+      any upstream <code>outPoolUsage</code> high</th>
+    <td class="tg-center">
+      <span class="glyphicon glyphicon-remove-sign" aria-hidden="true" 
style="color:red;font-size:1.5em;"></span><br />
+      (backpressure on only some of the input channels)</td>
+    <td class="tg-center">
+      <span class="glyphicon glyphicon-remove-sign" aria-hidden="true" 
style="color:red;font-size:1.5em;"></span><br />
+      (backpressure on most or all of the input channels)</td>
+  </tr>
+</table>
+</center>
+
+  <p><sup>3</sup> this should not happen</p>
+
+</div>
+
+<h3 id="resource-usage--throughput">Resource Usage / Throughput</h3>
+
+<p>Besides the obvious use of each individual metric mentioned above, there 
are also a few combinations providing useful insight into what is happening in 
the network stack:</p>
+
+<ul>
+  <li>
+    <p>Low throughput with frequent <code>outPoolUsage</code> values around 
100% but low <code>inPoolUsage</code> on all receivers is an indicator that the 
round-trip-time of our credit-notification (depends on your network’s latency) 
is too high for the default number of exclusive buffers to make use of your 
bandwidth. Consider increasing the <a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#taskmanager-network-memory-buffers-per-channel";>buffers-per-c
 [...]
+  </li>
+  <li>
+    <p>Combining <code>numRecordsOut</code> and <code>numBytesOut</code> helps 
identifying average serialised record sizes which supports you in capacity 
planning for peak scenarios.</p>
+  </li>
+  <li>
+    <p>If you want to reason about buffer fill rates and the influence of the 
output flusher, you may combine <code>numBytesInRemote</code> with 
<code>numBuffersInRemote</code>. When tuning for throughput (and not latency!), 
low buffer fill rates may indicate reduced network efficiency. In such cases, 
consider increasing the buffer timeout.
+Please note that, as of Flink 1.8 and 1.9, <code>numBuffersOut</code> only 
increases for buffers getting full or for an event cutting off a buffer (e.g. a 
checkpoint barrier) and may lag behind. Please also note that reasoning about 
buffer fill rates on local channels is unnecessary since buffering is an 
optimisation technique for remote channels with limited effect on local 
channels.</p>
+  </li>
+  <li>
+    <p>You may also separate local from remote traffic using numBytesInLocal 
and numBytesInRemote but in most cases this is unnecessary.</p>
+  </li>
+</ul>
+
+<div class="alert alert-info">
+  <h3 class="no_toc" 
id="span-classglyphicon-glyphicon-info-sign-aria-hiddentruespan-what-to-do-with-backpressure"><span
 class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> What to do 
with Backpressure?</h3>
+
+  <p>Assuming that you identified where the source of backpressure — a 
bottleneck — is located, the next step is to analyse why this is happening. 
Below, we list some potential causes of backpressure from the more basic to the 
more complex ones. We recommend to check the basic causes first, before diving 
deeper on the more complex ones and potentially drawing false conclusions.</p>
+
+  <p>Please also recall that backpressure might be temporary and the result of 
a load spike, checkpointing, or a job restart with a data backlog waiting to be 
processed. If backpressure is temporary, you should simply ignore it. 
Alternatively, keep in mind that the process of analysing and solving the issue 
can be affected by the intermittent nature of your bottleneck. Having said 
that, here are a couple of things to check.</p>
+
+  <h4 id="system-resources">System Resources</h4>
+
+  <p>Firstly, you should check the incriminated machines’ basic resource usage 
like CPU, network, or disk I/O. If some resource is fully or heavily utilised 
you can do one of the following:</p>
+
+  <ol>
+    <li>Try to optimise your code. Code profilers are helpful in this 
case.</li>
+    <li>Tune Flink for that specific resource.</li>
+    <li>Scale out by increasing the parallelism and/or increasing the number 
of machines in the cluster.</li>
+  </ol>
+
+  <h4 id="garbage-collection">Garbage Collection</h4>
+
+  <p>Oftentimes, performance issues arise from long GC pauses. You can verify 
whether you are in such a situation by either printing debug GC logs (via 
-<code>XX:+PrintGCDetails</code>) or by using some memory/GC profilers. Since 
dealing with GC issues is highly application-dependent and independent of 
Flink, we will not go into details here (<a 
href="https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/index.html";>Oracle’s
 Garbage Collection Tuning Guide</a> or <a href="ht [...]
+
+  <h4 id="cputhread-bottleneck">CPU/Thread Bottleneck</h4>
+
+  <p>Sometimes a CPU bottleneck might not be visible at first glance if one or 
a couple of threads are causing the CPU bottleneck while the CPU usage of the 
overall machine remains relatively low. For instance, a single CPU-bottlenecked 
thread on a 48-core machine would result in only 2% CPU use. Consider using 
code profilers for this as they can identify hot threads by showing each 
threads’ CPU usage, for example.</p>
+
+  <h4 id="thread-contention">Thread Contention</h4>
+
+  <p>Similarly to the CPU/thread bottleneck issue above, a subtask may be 
bottlenecked due to high thread contention on shared resources. Again, CPU 
profilers are your best friend here! Consider looking for synchronisation 
overhead / lock contention in user code — although adding synchronisation in 
user code should be avoided and may even be dangerous! Also consider 
investigating shared system resources. The default JVM’s SSL implementation, 
for example, can become contented around the s [...]
+
+  <h4 id="load-imbalance">Load Imbalance</h4>
+
+  <p>If your bottleneck is caused by data skew, you can try to remove it or 
mitigate its impact by changing the data partitioning to separate heavy keys or 
by implementing local/pre-aggregation.</p>
+
+  <p><br />
+This list is far from exhaustive. Generally, in order to reduce a bottleneck 
and thus backpressure, first analyse where it is happening and then find out 
why. The best place to start reasoning about the “why” is by checking what 
resources are fully utilised.</p>
+</div>
+
+<h3 id="latency-tracking">Latency Tracking</h3>
+
+<p>Tracking latencies at the various locations they may occur is a topic of 
its own. In this section, we will focus on the time records wait inside Flink’s 
network stack — including the system’s network connections. In low throughput 
scenarios, these latencies are influenced directly by the output flusher via 
the buffer timeout parameter or indirectly by any application code latencies. 
When processing a record takes longer than expected or when (multiple) timers 
fire at the same time — a [...]
+
+<p>Flink offers some support for <a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/metrics.html#latency-tracking";>tracking
 the latency</a> of records passing through the system (outside of user code). 
However, this is disabled by default (see below why!) and must be enabled by 
setting a latency tracking interval either in Flink’s <a 
href="https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#metrics-latency-interval";>configuration
 via < [...]
+
+<ul>
+  <li><code>single</code>: one histogram for each operator subtask</li>
+  <li><code>operator</code> (default): one histogram for each combination of 
source task and operator subtask</li>
+  <li><code>subtask</code>: one histogram for each combination of source 
subtask and operator subtask (quadratic in the parallelism!)</li>
+</ul>
+
+<p>These metrics are collected through special “latency markers”: each source 
subtask will periodically emit a special record containing the timestamp of its 
creation. The latency markers then flow alongside normal records while not 
overtaking them on the wire or inside a buffer queue. However, <em>a latency 
marker does not enter application logic</em> and is overtaking records there. 
Latency markers therefore only measure the waiting time between the user code 
and not a full “end-to-end [...]
+
+<p>Since <code>LatencyMarkers</code> sit in network buffers just like normal 
records, they will also wait for the buffer to be full or flushed due to buffer 
timeouts. When a channel is on high load, there is no added latency by the 
network buffering data. However, as soon as one channel is under low load, 
records and latency markers will experience an expected average delay of at 
most <code>buffer_timeout / 2</code>. This delay will add to each network 
connection towards a subtask and sh [...]
+
+<p>By looking at the exposed latency tracking metrics for each subtask, for 
example at the 95th percentile, you should nevertheless be able to identify 
subtasks which are adding substantially to the overall source-to-sink latency 
and continue with optimising there.</p>
+
+<div class="alert alert-info">
+  <p><span class="label label-info" style="display: inline-block"><span 
class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> Note</span>
+Flink’s latency markers assume that the clocks on all machines in the cluster 
are in sync. We recommend setting up an automated clock synchronisation service 
(like NTP) to avoid false latency results.</p>
+</div>
+
+<div class="alert alert-warning">
+  <p><span class="label label-warning" style="display: inline-block"><span 
class="glyphicon glyphicon-warning-sign" aria-hidden="true"></span> 
Warning</span>
+Enabling latency metrics can significantly impact the performance of the 
cluster (in particular for <code>subtask</code> granularity) due to the sheer 
amount of metrics being added as well as the use of histograms which are quite 
expensive to maintain. It is highly recommended to only use them for debugging 
purposes.</p>
+</div>
+
+<h2 id="conclusion">Conclusion</h2>
+
+<p>In the previous sections we discussed how to monitor Flink’s network stack 
which primarily involves identifying backpressure: where it occurs, where it 
originates from, and (potentially) why it occurs. This can be executed in two 
ways: for simple cases and debugging sessions by using the backpressure 
monitor; for continuous monitoring, more in-depth analysis, and less runtime 
overhead by using Flink’s task and network stack metrics. Backpressure can be 
caused by the network layer itse [...]
+
+<p>Stay tuned for the third blog post in the series of network stack posts 
that will focus on tuning techniques and anti-patterns to avoid.</p>
+
+
+      </article>
+    </div>
+
+    <div class="row">
+      <div id="disqus_thread"></div>
+      <script type="text/javascript">
+        /* * * CONFIGURATION VARIABLES: EDIT BEFORE PASTING INTO YOUR WEBPAGE 
* * */
+        var disqus_shortname = 'stratosphere-eu'; // required: replace example 
with your forum shortname
+
+        /* * * DON'T EDIT BELOW THIS LINE * * */
+        (function() {
+            var dsq = document.createElement('script'); dsq.type = 
'text/javascript'; dsq.async = true;
+            dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
+             (document.getElementsByTagName('head')[0] || 
document.getElementsByTagName('body')[0]).appendChild(dsq);
+        })();
+      </script>
+    </div>
+  </div>
+</div>
+      </div>
+    </div>
+
+    <hr />
+
+    <div class="row">
+      <div class="footer text-center col-sm-12">
+        <p>Copyright © 2014-2019 <a href="http://apache.org";>The Apache 
Software Foundation</a>. All Rights Reserved.</p>
+        <p>Apache Flink, Flink®, Apache®, the squirrel logo, and the Apache 
feather logo are either registered trademarks or trademarks of The Apache 
Software Foundation.</p>
+        <p><a href="/privacy-policy.html">Privacy Policy</a> &middot; <a 
href="/blog/feed.xml">RSS feed</a></p>
+      </div>
+    </div>
+    </div><!-- /.container -->
+
+    <!-- Include all compiled plugins (below), or include individual files as 
needed -->
+    <script 
src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/js/bootstrap.min.js";></script>
+    <script 
src="https://cdnjs.cloudflare.com/ajax/libs/jquery.matchHeight/0.7.0/jquery.matchHeight-min.js";></script>
+    <script src="/js/codetabs.js"></script>
+    <script src="/js/stickysidebar.js"></script>
+
+    <!-- Google Analytics -->
+    <script>
+      
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+      (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new 
Date();a=s.createElement(o),
+      
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+      
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+
+      ga('create', 'UA-52545728-1', 'auto');
+      ga('send', 'pageview');
+    </script>
+  </body>
+</html>
diff --git a/content/blog/feed.xml b/content/blog/feed.xml
index ebfe803..cc8dfe3 100644
--- a/content/blog/feed.xml
+++ b/content/blog/feed.xml
@@ -7,6 +7,366 @@
 <atom:link href="https://flink.apache.org/blog/feed.xml"; rel="self" 
type="application/rss+xml" />
 
 <item>
+<title>Flink Network Stack Vol. 2: Monitoring, Metrics, and that Backpressure 
Thing</title>
+<description>&lt;style type=&quot;text/css&quot;&gt;
+.tg  {border-collapse:collapse;border-spacing:0;}
+.tg td{padding:10px 
10px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
+.tg th{padding:10px 
10px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;background-color:#eff0f1;}
+.tg .tg-wide{padding:10px 30px;}
+.tg .tg-top{vertical-align:top}
+.tg .tg-topcenter{text-align:center;vertical-align:top}
+.tg .tg-center{text-align:center;vertical-align:center}
+&lt;/style&gt;
+
+&lt;p&gt;In a &lt;a 
href=&quot;/2019/06/05/flink-network-stack.html&quot;&gt;previous blog 
post&lt;/a&gt;, we presented how Flink’s network stack works from the 
high-level abstractions to the low-level details. This second blog post in the 
series of network stack posts extends on this knowledge and discusses 
monitoring network-related metrics to identify effects such as backpressure or 
bottlenecks in throughput and latency. Although this post briefly covers what 
to do with backpressure,  [...]
+
+&lt;div class=&quot;page-toc&quot;&gt;
+&lt;ul id=&quot;markdown-toc&quot;&gt;
+  &lt;li&gt;&lt;a href=&quot;#monitoring&quot; 
id=&quot;markdown-toc-monitoring&quot;&gt;Monitoring&lt;/a&gt;    &lt;ul&gt;
+      &lt;li&gt;&lt;a href=&quot;#backpressure-monitor&quot; 
id=&quot;markdown-toc-backpressure-monitor&quot;&gt;Backpressure 
Monitor&lt;/a&gt;&lt;/li&gt;
+    &lt;/ul&gt;
+  &lt;/li&gt;
+  &lt;li&gt;&lt;a href=&quot;#network-metrics&quot; 
id=&quot;markdown-toc-network-metrics&quot;&gt;Network Metrics&lt;/a&gt;    
&lt;ul&gt;
+      &lt;li&gt;&lt;a href=&quot;#backpressure&quot; 
id=&quot;markdown-toc-backpressure&quot;&gt;Backpressure&lt;/a&gt;&lt;/li&gt;
+      &lt;li&gt;&lt;a href=&quot;#resource-usage--throughput&quot; 
id=&quot;markdown-toc-resource-usage--throughput&quot;&gt;Resource Usage / 
Throughput&lt;/a&gt;&lt;/li&gt;
+      &lt;li&gt;&lt;a href=&quot;#latency-tracking&quot; 
id=&quot;markdown-toc-latency-tracking&quot;&gt;Latency 
Tracking&lt;/a&gt;&lt;/li&gt;
+    &lt;/ul&gt;
+  &lt;/li&gt;
+  &lt;li&gt;&lt;a href=&quot;#conclusion&quot; 
id=&quot;markdown-toc-conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;/div&gt;
+
+&lt;h2 id=&quot;monitoring&quot;&gt;Monitoring&lt;/h2&gt;
+
+&lt;p&gt;Probably the most important part of network monitoring is &lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/back_pressure.html&quot;&gt;monitoring
 backpressure&lt;/a&gt;, a situation where a system is receiving data at a 
higher rate than it can process¹. Such behaviour will result in the sender 
being backpressured and may be caused by two things:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;The receiver is slow.&lt;br /&gt;
+This can happen because the receiver is backpressured itself, is unable to 
keep processing at the same rate as the sender, or is temporarily blocked by 
garbage collection, lack of system resources, or I/O.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;The network channel is slow.&lt;br /&gt;
+  Even though in such case the receiver is not (directly) involved, we call 
the sender backpressured due to a potential oversubscription on network 
bandwidth shared by all subtasks running on the same machine. Beware that, in 
addition to Flink’s network stack, there may be more network users, such as 
sources and sinks, distributed file systems (checkpointing, network-attached 
storage), logging, and metrics. A previous &lt;a 
href=&quot;https://www.ververica.com/blog/how-to-size-your-apach [...]
+  &lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;sup&gt;1&lt;/sup&gt; In case you are unfamiliar with backpressure 
and how it interacts with Flink, we recommend reading through &lt;a 
href=&quot;https://www.ververica.com/blog/how-flink-handles-backpressure&quot;&gt;this
 blog post on backpressure&lt;/a&gt; from 2015.&lt;/p&gt;
+
+&lt;p&gt;&lt;br /&gt;
+If backpressure occurs, it will bubble upstream and eventually reach your 
sources and slow them down. This is not a bad thing per-se and merely states 
that you lack resources for the current load. However, you may want to improve 
your job so that it can cope with higher loads without using more resources. In 
order to do so, you need to find (1) where (at which task/operator) the 
bottleneck is and (2) what is causing it. Flink offers two mechanisms for 
identifying where the bottleneck is: [...]
+
+&lt;ul&gt;
+  &lt;li&gt;directly via Flink’s web UI and its backpressure monitor, 
or&lt;/li&gt;
+  &lt;li&gt;indirectly through some of the network metrics.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;Flink’s web UI is likely the first entry point for a quick 
troubleshooting but has some disadvantages that we will explain below. On the 
other hand, Flink’s network metrics are better suited for continuous monitoring 
and reasoning about the exact nature of the bottleneck causing backpressure. We 
will cover both in the sections below. In both cases, you need to identify the 
origin of backpressure from the sources to the sinks. Your starting point for 
the current and future invest [...]
+
+&lt;h3 id=&quot;backpressure-monitor&quot;&gt;Backpressure Monitor&lt;/h3&gt;
+
+&lt;p&gt;The &lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/back_pressure.html&quot;&gt;backpressure
 monitor&lt;/a&gt; is only exposed via Flink’s web UI². Since it’s an active 
component that is only triggered on request, it is currently not available via 
metrics. The backpressure monitor samples the running tasks’ threads on all 
TaskManagers via &lt;code&gt;Thread.getStackTrace()&lt;/code&gt; and computes 
the number of samples where tasks were bl [...]
+
+&lt;ul&gt;
+  &lt;li&gt;&lt;span style=&quot;color:green&quot;&gt;OK&lt;/span&gt; for 
&lt;code&gt;ratio ≤ 0.10&lt;/code&gt;,&lt;/li&gt;
+  &lt;li&gt;&lt;span style=&quot;color:orange&quot;&gt;LOW&lt;/span&gt; for 
&lt;code&gt;0.10 &amp;lt; Ratio ≤ 0.5&lt;/code&gt;, and&lt;/li&gt;
+  &lt;li&gt;&lt;span style=&quot;color:red&quot;&gt;HIGH&lt;/span&gt; for 
&lt;code&gt;0.5 &amp;lt; Ratio ≤ 1&lt;/code&gt;.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;Although you can tune things like the refresh-interval, the number of 
samples, or the delay between samples, normally, you would not need to touch 
these since the defaults already give good-enough results.&lt;/p&gt;
+
+&lt;center&gt;
+&lt;img 
src=&quot;/img/blog/2019-07-23-network-stack-2/back_pressure_sampling_high.png&quot;
 width=&quot;600px&quot; alt=&quot;Backpressure sampling:high&quot; /&gt;
+&lt;/center&gt;
+
+&lt;p&gt;&lt;sup&gt;2&lt;/sup&gt; You may also access the backpressure monitor 
via the REST API: 
&lt;code&gt;/jobs/:jobid/vertices/:vertexid/backpressure&lt;/code&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;br /&gt;
+The backpressure monitor can help you find where (at which task/operator) 
backpressure originates from. However, it does not support you in further 
reasoning about the causes of it. Additionally, for larger jobs or higher 
parallelism, the backpressure monitor becomes too crowded to use and may also 
take some time to gather all information from all TaskManagers. Please also 
note that sampling may affect your running job’s performance.&lt;/p&gt;
+
+&lt;h2 id=&quot;network-metrics&quot;&gt;Network Metrics&lt;/h2&gt;
+
+&lt;p&gt;&lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/metrics.html#network&quot;&gt;Network&lt;/a&gt;
 and &lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/metrics.html#io&quot;&gt;task
 I/O&lt;/a&gt; metrics are more lightweight than the backpressure monitor and 
are continuously published for each running job. We can leverage those and get 
even more insights, not only for backpressure monitoring. The most re [...]
+
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;strong&gt;&lt;span style=&quot;color:orange&quot;&gt;up to 
Flink 1.8:&lt;/span&gt;&lt;/strong&gt; &lt;code&gt;outPoolUsage&lt;/code&gt;, 
&lt;code&gt;inPoolUsage&lt;/code&gt;&lt;br /&gt;
+An estimate on the ratio of buffers used vs. buffers available in the 
respective local buffer pools.
+While interpreting &lt;code&gt;inPoolUsage&lt;/code&gt; in Flink 1.5 - 1.8 
with credit-based flow control, please note that this only relates to floating 
buffers (exclusive buffers are not part of the pool).&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;strong&gt;&lt;span style=&quot;color:green&quot;&gt;Flink 1.9 
and above:&lt;/span&gt;&lt;/strong&gt; &lt;code&gt;outPoolUsage&lt;/code&gt;, 
&lt;code&gt;inPoolUsage&lt;/code&gt;, 
&lt;code&gt;floatingBuffersUsage&lt;/code&gt;, 
&lt;code&gt;exclusiveBuffersUsage&lt;/code&gt;&lt;br /&gt;
+An estimate on the ratio of buffers used vs. buffers available in the 
respective local buffer pools.
+Starting with Flink 1.9, &lt;code&gt;inPoolUsage&lt;/code&gt; is the sum of 
&lt;code&gt;floatingBuffersUsage&lt;/code&gt; and 
&lt;code&gt;exclusiveBuffersUsage&lt;/code&gt;.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;code&gt;numRecordsOut&lt;/code&gt;, 
&lt;code&gt;numRecordsIn&lt;/code&gt;&lt;br /&gt;
+Each metric comes with two scopes: one scoped to the operator and one scoped 
to the subtask. For network monitoring, the subtask-scoped metric is relevant 
and shows the total number of records it has sent/received. You may need to 
further look into these figures to extract the number of records within a 
certain time span or use the equivalent &lt;code&gt;…PerSecond&lt;/code&gt; 
metrics.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;code&gt;numBytesOut&lt;/code&gt;, 
&lt;code&gt;numBytesInLocal&lt;/code&gt;, 
&lt;code&gt;numBytesInRemote&lt;/code&gt;&lt;br /&gt;
+The total number of bytes this subtask has emitted or read from a local/remote 
source. These are also available as meters via 
&lt;code&gt;…PerSecond&lt;/code&gt; metrics.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;code&gt;numBuffersOut&lt;/code&gt;, 
&lt;code&gt;numBuffersInLocal&lt;/code&gt;, 
&lt;code&gt;numBuffersInRemote&lt;/code&gt;&lt;br /&gt;
+Similar to &lt;code&gt;numBytes…&lt;/code&gt; but counting the number of 
network buffers.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;div class=&quot;alert alert-warning&quot;&gt;
+  &lt;p&gt;&lt;span class=&quot;label label-warning&quot; style=&quot;display: 
inline-block&quot;&gt;&lt;span class=&quot;glyphicon 
glyphicon-warning-sign&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt; 
Warning&lt;/span&gt;
+For the sake of completeness and since they have been used in the past, we 
will briefly look at the &lt;code&gt;outputQueueLength&lt;/code&gt; and 
&lt;code&gt;inputQueueLength&lt;/code&gt; metrics. These are somewhat similar 
to the &lt;code&gt;[out,in]PoolUsage&lt;/code&gt; metrics but show the number 
of buffers sitting in a sender subtask’s output queues and in a receiver 
subtask’s input queues, respectively. Reasoning about absolute numbers of 
buffers, however, is difficult and there i [...]
+
+  &lt;p&gt;Overall, &lt;strong&gt;we discourage the use of&lt;/strong&gt; 
&lt;code&gt;outputQueueLength&lt;/code&gt; &lt;strong&gt;and&lt;/strong&gt; 
&lt;code&gt;inputQueueLength&lt;/code&gt; because their interpretation highly 
depends on the current parallelism of the operator and the configured numbers 
of exclusive and floating buffers. Instead, we recommend using the various 
&lt;code&gt;*PoolUsage&lt;/code&gt; metrics which even reveal more detailed 
insight.&lt;/p&gt;
+&lt;/div&gt;
+
+&lt;div class=&quot;alert alert-info&quot;&gt;
+  &lt;p&gt;&lt;span class=&quot;label label-info&quot; style=&quot;display: 
inline-block&quot;&gt;&lt;span class=&quot;glyphicon glyphicon-info-sign&quot; 
aria-hidden=&quot;true&quot;&gt;&lt;/span&gt; Note&lt;/span&gt;
+ If you reason about buffer usage, please keep the following in mind:&lt;/p&gt;
+
+  &lt;ul&gt;
+    &lt;li&gt;Any outgoing channel which has been used at least once will 
always occupy one buffer (since Flink 1.5).
+      &lt;ul&gt;
+        &lt;li&gt;&lt;strong&gt;&lt;span style=&quot;color:orange&quot;&gt;up 
to Flink 1.8:&lt;/span&gt;&lt;/strong&gt; This buffer (even if empty!) was 
always counted as a backlog of 1 and thus receivers tried to reserve a floating 
buffer for it.&lt;/li&gt;
+        &lt;li&gt;&lt;strong&gt;&lt;span 
style=&quot;color:green&quot;&gt;Flink 1.9 and 
above:&lt;/span&gt;&lt;/strong&gt; A buffer is only counted in the backlog if 
it is ready for consumption, i.e. it is full or was flushed (see 
FLINK-11082)&lt;/li&gt;
+      &lt;/ul&gt;
+    &lt;/li&gt;
+    &lt;li&gt;The receiver will only release a received buffer after 
deserialising the last record in it.&lt;/li&gt;
+  &lt;/ul&gt;
+&lt;/div&gt;
+
+&lt;p&gt;The following sections make use of and combine these metrics to 
reason about backpressure and resource usage / efficiency with respect to 
throughput. A separate section will detail latency related metrics.&lt;/p&gt;
+
+&lt;h3 id=&quot;backpressure&quot;&gt;Backpressure&lt;/h3&gt;
+
+&lt;p&gt;Backpressure may be indicated by two different sets of metrics: 
(local) buffer pool usages as well as input/output queue lengths. They provide 
a different level of granularity but, unfortunately, none of these are 
exhaustive and there is room for interpretation. Because of the inherent 
problems with interpreting these queue lengths we will focus on the usage of 
input and output pools below which also provides more detail.&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;strong&gt;If a subtask’s&lt;/strong&gt; 
&lt;code&gt;outPoolUsage&lt;/code&gt; &lt;strong&gt;is 100%&lt;/strong&gt;, it 
is backpressured. Whether the subtask is already blocking or still writing 
records into network buffers depends on how full the buffers are, that the 
&lt;code&gt;RecordWriters&lt;/code&gt; are currently writing into.&lt;br /&gt;
+&lt;span class=&quot;glyphicon glyphicon-warning-sign&quot; 
aria-hidden=&quot;true&quot; style=&quot;color:orange;&quot;&gt;&lt;/span&gt; 
This is different to what the backpressure monitor is showing!&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;An &lt;code&gt;inPoolUsage&lt;/code&gt; of 100% means that all 
floating buffers are assigned to channels and eventually backpressure will be 
exercised upstream. These floating buffers are in either of the following 
conditions: they are reserved for future use on a channel due to an exclusive 
buffer being utilised (remote input channels always try to maintain 
&lt;code&gt;#exclusive buffers&lt;/code&gt; credits), they are reserved for a 
sender’s backlog and wait for data, they [...]
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;strong&gt;&lt;span style=&quot;color:orange&quot;&gt;up to 
Flink 1.8:&lt;/span&gt;&lt;/strong&gt; Due to &lt;a 
href=&quot;https://issues.apache.org/jira/browse/FLINK-11082&quot;&gt;FLINK-11082&lt;/a&gt;,
 an &lt;code&gt;inPoolUsage&lt;/code&gt; of 100% is quite common even in normal 
situations.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;strong&gt;&lt;span style=&quot;color:green&quot;&gt;Flink 1.9 
and above:&lt;/span&gt;&lt;/strong&gt; If &lt;code&gt;inPoolUsage&lt;/code&gt; 
is constantly around 100%, this is a strong indicator for exercising 
backpressure upstream.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;The following table summarises all combinations and their 
interpretation. Bear in mind, though, that backpressure may be minor or 
temporary (no need to look into it), on particular channels only, or caused by 
other JVM processes on a particular TaskManager, such as GC, synchronisation, 
I/O, resource shortage, instead of a specific subtask.&lt;/p&gt;
+
+&lt;center&gt;
+&lt;table class=&quot;tg&quot;&gt;
+  &lt;tr&gt;
+    &lt;th&gt;&lt;/th&gt;
+    &lt;th 
class=&quot;tg-center&quot;&gt;&lt;code&gt;outPoolUsage&lt;/code&gt; 
low&lt;/th&gt;
+    &lt;th 
class=&quot;tg-center&quot;&gt;&lt;code&gt;outPoolUsage&lt;/code&gt; 
high&lt;/th&gt;
+  &lt;/tr&gt;
+  &lt;tr&gt;
+    &lt;th class=&quot;tg-top&quot;&gt;&lt;code&gt;inPoolUsage&lt;/code&gt; 
low&lt;/th&gt;
+    &lt;td class=&quot;tg-topcenter&quot;&gt;
+      &lt;span class=&quot;glyphicon glyphicon-ok-sign&quot; 
aria-hidden=&quot;true&quot; 
style=&quot;color:green;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;/td&gt;
+    &lt;td class=&quot;tg-topcenter&quot;&gt;
+      &lt;span class=&quot;glyphicon glyphicon-warning-sign&quot; 
aria-hidden=&quot;true&quot; 
style=&quot;color:orange;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
+      (backpressured, temporary situation: upstream is not backpressured yet 
or not anymore)&lt;/td&gt;
+  &lt;/tr&gt;
+  &lt;tr&gt;
+    &lt;th class=&quot;tg-top&quot; rowspan=&quot;2&quot;&gt;
+      &lt;code&gt;inPoolUsage&lt;/code&gt; high&lt;br /&gt;
+      (&lt;strong&gt;&lt;span style=&quot;color:green&quot;&gt;Flink 
1.9+&lt;/span&gt;&lt;/strong&gt;)&lt;/th&gt;
+    &lt;td class=&quot;tg-topcenter&quot;&gt;
+      if all upstream tasks’&lt;code&gt;outPoolUsage&lt;/code&gt; are low: 
&lt;span class=&quot;glyphicon glyphicon-warning-sign&quot; 
aria-hidden=&quot;true&quot; 
style=&quot;color:orange;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
+      (may eventually cause backpressure)&lt;/td&gt;
+    &lt;td class=&quot;tg-topcenter&quot; rowspan=&quot;2&quot;&gt;
+      &lt;span class=&quot;glyphicon glyphicon-remove-sign&quot; 
aria-hidden=&quot;true&quot; 
style=&quot;color:red;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
+      (backpressured by downstream task(s) or network, probably forwarding 
backpressure upstream)&lt;/td&gt;
+  &lt;/tr&gt;
+  &lt;tr&gt;
+    &lt;td class=&quot;tg-topcenter&quot;&gt;if any upstream 
task’s&lt;code&gt;outPoolUsage&lt;/code&gt; is high: &lt;span 
class=&quot;glyphicon glyphicon-remove-sign&quot; aria-hidden=&quot;true&quot; 
style=&quot;color:red;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
+      (may exercise backpressure upstream and may be the source of 
backpressure)&lt;/td&gt;
+  &lt;/tr&gt;
+&lt;/table&gt;
+&lt;/center&gt;
+
+&lt;p&gt;&lt;br /&gt;
+We may even reason more about the cause of backpressure by looking at the 
network metrics of the subtasks of two consecutive tasks:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;If all subtasks of the receiver task have low 
&lt;code&gt;inPoolUsage&lt;/code&gt; values and any upstream subtask’s 
&lt;code&gt;outPoolUsage&lt;/code&gt; is high, then there may be a network 
bottleneck causing backpressure.
+Since network is a shared resource among all subtasks of a TaskManager, this 
may not directly originate from this subtask, but rather from various 
concurrent operations, e.g. checkpoints, other streams, external connections, 
or other TaskManagers/processes on the same machine.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;Backpressure can also be caused by all parallel instances of a task 
or by a single task instance. The first usually happens because the task is 
performing some time consuming operation that applies to all input partitions. 
The latter is usually the result of some kind of skew, either data skew or 
resource availability/allocation skew. In either case, you can find some hints 
on how to handle such situations in the &lt;a 
href=&quot;#span-classlabel-label-info-styledisplay-inline-b [...]
+
+&lt;div class=&quot;alert alert-info&quot;&gt;
+  &lt;h3 class=&quot;no_toc&quot; 
id=&quot;span-classglyphicon-glyphicon-info-sign-aria-hiddentruespan-flink-19-and-above&quot;&gt;&lt;span
 class=&quot;glyphicon glyphicon-info-sign&quot; 
aria-hidden=&quot;true&quot;&gt;&lt;/span&gt; Flink 1.9 and above&lt;/h3&gt;
+
+  &lt;ul&gt;
+    &lt;li&gt;If &lt;code&gt;floatingBuffersUsage&lt;/code&gt; is not 100%, it 
is unlikely that there is backpressure. If it is 100% and any upstream task is 
backpressured, it suggests that this input is exercising backpressure on either 
a single, some or all input channels. To differentiate between those three 
situations you can use &lt;code&gt;exclusiveBuffersUsage&lt;/code&gt;:
+      &lt;ul&gt;
+        &lt;li&gt;Assuming that &lt;code&gt;floatingBuffersUsage&lt;/code&gt; 
is around 100%, the higher the &lt;code&gt;exclusiveBuffersUsage&lt;/code&gt; 
the more input channels are backpressured. In an extreme case of 
&lt;code&gt;exclusiveBuffersUsage&lt;/code&gt; being close to 100%, it means 
that all channels are backpressured.&lt;/li&gt;
+      &lt;/ul&gt;
+    &lt;/li&gt;
+  &lt;/ul&gt;
+
+  &lt;p&gt;&lt;br /&gt;
+The relation between &lt;code&gt;exclusiveBuffersUsage&lt;/code&gt;, 
&lt;code&gt;floatingBuffersUsage&lt;/code&gt;, and the upstream tasks’ 
&lt;code&gt;outPoolUsage&lt;/code&gt; is summarised in the following table and 
extends on the table above with &lt;code&gt;inPoolUsage = floatingBuffersUsage 
+ exclusiveBuffersUsage&lt;/code&gt;:&lt;/p&gt;
+
+  &lt;center&gt;
+&lt;table class=&quot;tg&quot;&gt;
+  &lt;tr&gt;
+    &lt;th&gt;&lt;/th&gt;
+    &lt;th&gt;&lt;code&gt;exclusiveBuffersUsage&lt;/code&gt; low&lt;/th&gt;
+    &lt;th&gt;&lt;code&gt;exclusiveBuffersUsage&lt;/code&gt; high&lt;/th&gt;
+  &lt;/tr&gt;
+  &lt;tr&gt;
+    &lt;th class=&quot;tg-top&quot; style=&quot;min-width:33%;&quot;&gt;
+      &lt;code&gt;floatingBuffersUsage&lt;/code&gt; low +&lt;br /&gt;
+      &lt;em&gt;all&lt;/em&gt; upstream &lt;code&gt;outPoolUsage&lt;/code&gt; 
low&lt;/th&gt;
+    &lt;td class=&quot;tg-center&quot;&gt;&lt;span class=&quot;glyphicon 
glyphicon-ok-sign&quot; aria-hidden=&quot;true&quot; 
style=&quot;color:green;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;/td&gt;
+    &lt;td class=&quot;tg-center&quot;&gt;-&lt;sup&gt;3&lt;/sup&gt;&lt;/td&gt;
+  &lt;/tr&gt;
+  &lt;tr&gt;
+    &lt;th class=&quot;tg-top&quot; style=&quot;min-width:33%;&quot;&gt;
+      &lt;code&gt;floatingBuffersUsage&lt;/code&gt; low +&lt;br /&gt;
+      &lt;em&gt;any&lt;/em&gt; upstream &lt;code&gt;outPoolUsage&lt;/code&gt; 
high&lt;/th&gt;
+    &lt;td class=&quot;tg-center&quot;&gt;
+      &lt;span class=&quot;glyphicon glyphicon-remove-sign&quot; 
aria-hidden=&quot;true&quot; 
style=&quot;color:red;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
+      (potential network bottleneck)&lt;/td&gt;
+    &lt;td class=&quot;tg-center&quot;&gt;-&lt;sup&gt;3&lt;/sup&gt;&lt;/td&gt;
+  &lt;/tr&gt;
+  &lt;tr&gt;
+    &lt;th class=&quot;tg-top&quot; style=&quot;min-width:33%;&quot;&gt;
+      &lt;code&gt;floatingBuffersUsage&lt;/code&gt; high +&lt;br /&gt;
+      &lt;em&gt;all&lt;/em&gt; upstream &lt;code&gt;outPoolUsage&lt;/code&gt; 
low&lt;/th&gt;
+    &lt;td class=&quot;tg-center&quot;&gt;
+      &lt;span class=&quot;glyphicon glyphicon-warning-sign&quot; 
aria-hidden=&quot;true&quot; 
style=&quot;color:orange;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
+      (backpressure eventually appears on only some of the input 
channels)&lt;/td&gt;
+    &lt;td class=&quot;tg-center&quot;&gt;
+      &lt;span class=&quot;glyphicon glyphicon-warning-sign&quot; 
aria-hidden=&quot;true&quot; 
style=&quot;color:orange;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
+      (backpressure eventually appears on most or all of the input 
channels)&lt;/td&gt;
+  &lt;/tr&gt;
+  &lt;tr&gt;
+    &lt;th class=&quot;tg-top&quot; style=&quot;min-width:33%;&quot;&gt;
+      &lt;code&gt;floatingBuffersUsage&lt;/code&gt; high +&lt;br /&gt;
+      any upstream &lt;code&gt;outPoolUsage&lt;/code&gt; high&lt;/th&gt;
+    &lt;td class=&quot;tg-center&quot;&gt;
+      &lt;span class=&quot;glyphicon glyphicon-remove-sign&quot; 
aria-hidden=&quot;true&quot; 
style=&quot;color:red;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
+      (backpressure on only some of the input channels)&lt;/td&gt;
+    &lt;td class=&quot;tg-center&quot;&gt;
+      &lt;span class=&quot;glyphicon glyphicon-remove-sign&quot; 
aria-hidden=&quot;true&quot; 
style=&quot;color:red;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
+      (backpressure on most or all of the input channels)&lt;/td&gt;
+  &lt;/tr&gt;
+&lt;/table&gt;
+&lt;/center&gt;
+
+  &lt;p&gt;&lt;sup&gt;3&lt;/sup&gt; this should not happen&lt;/p&gt;
+
+&lt;/div&gt;
+
+&lt;h3 id=&quot;resource-usage--throughput&quot;&gt;Resource Usage / 
Throughput&lt;/h3&gt;
+
+&lt;p&gt;Besides the obvious use of each individual metric mentioned above, 
there are also a few combinations providing useful insight into what is 
happening in the network stack:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;Low throughput with frequent 
&lt;code&gt;outPoolUsage&lt;/code&gt; values around 100% but low 
&lt;code&gt;inPoolUsage&lt;/code&gt; on all receivers is an indicator that the 
round-trip-time of our credit-notification (depends on your network’s latency) 
is too high for the default number of exclusive buffers to make use of your 
bandwidth. Consider increasing the &lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#taskmanager-network-mem
 [...]
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Combining &lt;code&gt;numRecordsOut&lt;/code&gt; and 
&lt;code&gt;numBytesOut&lt;/code&gt; helps identifying average serialised 
record sizes which supports you in capacity planning for peak 
scenarios.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;If you want to reason about buffer fill rates and the influence 
of the output flusher, you may combine 
&lt;code&gt;numBytesInRemote&lt;/code&gt; with 
&lt;code&gt;numBuffersInRemote&lt;/code&gt;. When tuning for throughput (and 
not latency!), low buffer fill rates may indicate reduced network efficiency. 
In such cases, consider increasing the buffer timeout.
+Please note that, as of Flink 1.8 and 1.9, 
&lt;code&gt;numBuffersOut&lt;/code&gt; only increases for buffers getting full 
or for an event cutting off a buffer (e.g. a checkpoint barrier) and may lag 
behind. Please also note that reasoning about buffer fill rates on local 
channels is unnecessary since buffering is an optimisation technique for remote 
channels with limited effect on local channels.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;You may also separate local from remote traffic using 
numBytesInLocal and numBytesInRemote but in most cases this is 
unnecessary.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;div class=&quot;alert alert-info&quot;&gt;
+  &lt;h3 class=&quot;no_toc&quot; 
id=&quot;span-classglyphicon-glyphicon-info-sign-aria-hiddentruespan-what-to-do-with-backpressure&quot;&gt;&lt;span
 class=&quot;glyphicon glyphicon-info-sign&quot; 
aria-hidden=&quot;true&quot;&gt;&lt;/span&gt; What to do with 
Backpressure?&lt;/h3&gt;
+
+  &lt;p&gt;Assuming that you identified where the source of backpressure — a 
bottleneck — is located, the next step is to analyse why this is happening. 
Below, we list some potential causes of backpressure from the more basic to the 
more complex ones. We recommend to check the basic causes first, before diving 
deeper on the more complex ones and potentially drawing false 
conclusions.&lt;/p&gt;
+
+  &lt;p&gt;Please also recall that backpressure might be temporary and the 
result of a load spike, checkpointing, or a job restart with a data backlog 
waiting to be processed. If backpressure is temporary, you should simply ignore 
it. Alternatively, keep in mind that the process of analysing and solving the 
issue can be affected by the intermittent nature of your bottleneck. Having 
said that, here are a couple of things to check.&lt;/p&gt;
+
+  &lt;h4 id=&quot;system-resources&quot;&gt;System Resources&lt;/h4&gt;
+
+  &lt;p&gt;Firstly, you should check the incriminated machines’ basic resource 
usage like CPU, network, or disk I/O. If some resource is fully or heavily 
utilised you can do one of the following:&lt;/p&gt;
+
+  &lt;ol&gt;
+    &lt;li&gt;Try to optimise your code. Code profilers are helpful in this 
case.&lt;/li&gt;
+    &lt;li&gt;Tune Flink for that specific resource.&lt;/li&gt;
+    &lt;li&gt;Scale out by increasing the parallelism and/or increasing the 
number of machines in the cluster.&lt;/li&gt;
+  &lt;/ol&gt;
+
+  &lt;h4 id=&quot;garbage-collection&quot;&gt;Garbage Collection&lt;/h4&gt;
+
+  &lt;p&gt;Oftentimes, performance issues arise from long GC pauses. You can 
verify whether you are in such a situation by either printing debug GC logs 
(via -&lt;code&gt;XX:+PrintGCDetails&lt;/code&gt;) or by using some memory/GC 
profilers. Since dealing with GC issues is highly application-dependent and 
independent of Flink, we will not go into details here (&lt;a 
href=&quot;https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/index.html&quot;&gt;Oracle’s
 Garbage Collecti [...]
+
+  &lt;h4 id=&quot;cputhread-bottleneck&quot;&gt;CPU/Thread 
Bottleneck&lt;/h4&gt;
+
+  &lt;p&gt;Sometimes a CPU bottleneck might not be visible at first glance if 
one or a couple of threads are causing the CPU bottleneck while the CPU usage 
of the overall machine remains relatively low. For instance, a single 
CPU-bottlenecked thread on a 48-core machine would result in only 2% CPU use. 
Consider using code profilers for this as they can identify hot threads by 
showing each threads’ CPU usage, for example.&lt;/p&gt;
+
+  &lt;h4 id=&quot;thread-contention&quot;&gt;Thread Contention&lt;/h4&gt;
+
+  &lt;p&gt;Similarly to the CPU/thread bottleneck issue above, a subtask may 
be bottlenecked due to high thread contention on shared resources. Again, CPU 
profilers are your best friend here! Consider looking for synchronisation 
overhead / lock contention in user code — although adding synchronisation in 
user code should be avoided and may even be dangerous! Also consider 
investigating shared system resources. The default JVM’s SSL implementation, 
for example, can become contented around [...]
+
+  &lt;h4 id=&quot;load-imbalance&quot;&gt;Load Imbalance&lt;/h4&gt;
+
+  &lt;p&gt;If your bottleneck is caused by data skew, you can try to remove it 
or mitigate its impact by changing the data partitioning to separate heavy keys 
or by implementing local/pre-aggregation.&lt;/p&gt;
+
+  &lt;p&gt;&lt;br /&gt;
+This list is far from exhaustive. Generally, in order to reduce a bottleneck 
and thus backpressure, first analyse where it is happening and then find out 
why. The best place to start reasoning about the “why” is by checking what 
resources are fully utilised.&lt;/p&gt;
+&lt;/div&gt;
+
+&lt;h3 id=&quot;latency-tracking&quot;&gt;Latency Tracking&lt;/h3&gt;
+
+&lt;p&gt;Tracking latencies at the various locations they may occur is a topic 
of its own. In this section, we will focus on the time records wait inside 
Flink’s network stack — including the system’s network connections. In low 
throughput scenarios, these latencies are influenced directly by the output 
flusher via the buffer timeout parameter or indirectly by any application code 
latencies. When processing a record takes longer than expected or when 
(multiple) timers fire at the same ti [...]
+
+&lt;p&gt;Flink offers some support for &lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/metrics.html#latency-tracking&quot;&gt;tracking
 the latency&lt;/a&gt; of records passing through the system (outside of user 
code). However, this is disabled by default (see below why!) and must be 
enabled by setting a latency tracking interval either in Flink’s &lt;a 
href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#metrics-l
 [...]
+
+&lt;ul&gt;
+  &lt;li&gt;&lt;code&gt;single&lt;/code&gt;: one histogram for each operator 
subtask&lt;/li&gt;
+  &lt;li&gt;&lt;code&gt;operator&lt;/code&gt; (default): one histogram for 
each combination of source task and operator subtask&lt;/li&gt;
+  &lt;li&gt;&lt;code&gt;subtask&lt;/code&gt;: one histogram for each 
combination of source subtask and operator subtask (quadratic in the 
parallelism!)&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;These metrics are collected through special “latency markers”: each 
source subtask will periodically emit a special record containing the timestamp 
of its creation. The latency markers then flow alongside normal records while 
not overtaking them on the wire or inside a buffer queue. However, &lt;em&gt;a 
latency marker does not enter application logic&lt;/em&gt; and is overtaking 
records there. Latency markers therefore only measure the waiting time between 
the user code and not  [...]
+
+&lt;p&gt;Since &lt;code&gt;LatencyMarkers&lt;/code&gt; sit in network buffers 
just like normal records, they will also wait for the buffer to be full or 
flushed due to buffer timeouts. When a channel is on high load, there is no 
added latency by the network buffering data. However, as soon as one channel is 
under low load, records and latency markers will experience an expected average 
delay of at most &lt;code&gt;buffer_timeout / 2&lt;/code&gt;. This delay will 
add to each network conne [...]
+
+&lt;p&gt;By looking at the exposed latency tracking metrics for each subtask, 
for example at the 95th percentile, you should nevertheless be able to identify 
subtasks which are adding substantially to the overall source-to-sink latency 
and continue with optimising there.&lt;/p&gt;
+
+&lt;div class=&quot;alert alert-info&quot;&gt;
+  &lt;p&gt;&lt;span class=&quot;label label-info&quot; style=&quot;display: 
inline-block&quot;&gt;&lt;span class=&quot;glyphicon glyphicon-info-sign&quot; 
aria-hidden=&quot;true&quot;&gt;&lt;/span&gt; Note&lt;/span&gt;
+Flink’s latency markers assume that the clocks on all machines in the cluster 
are in sync. We recommend setting up an automated clock synchronisation service 
(like NTP) to avoid false latency results.&lt;/p&gt;
+&lt;/div&gt;
+
+&lt;div class=&quot;alert alert-warning&quot;&gt;
+  &lt;p&gt;&lt;span class=&quot;label label-warning&quot; style=&quot;display: 
inline-block&quot;&gt;&lt;span class=&quot;glyphicon 
glyphicon-warning-sign&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt; 
Warning&lt;/span&gt;
+Enabling latency metrics can significantly impact the performance of the 
cluster (in particular for &lt;code&gt;subtask&lt;/code&gt; granularity) due to 
the sheer amount of metrics being added as well as the use of histograms which 
are quite expensive to maintain. It is highly recommended to only use them for 
debugging purposes.&lt;/p&gt;
+&lt;/div&gt;
+
+&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;
+
+&lt;p&gt;In the previous sections we discussed how to monitor Flink’s network 
stack which primarily involves identifying backpressure: where it occurs, where 
it originates from, and (potentially) why it occurs. This can be executed in 
two ways: for simple cases and debugging sessions by using the backpressure 
monitor; for continuous monitoring, more in-depth analysis, and less runtime 
overhead by using Flink’s task and network stack metrics. Backpressure can be 
caused by the network laye [...]
+
+&lt;p&gt;Stay tuned for the third blog post in the series of network stack 
posts that will focus on tuning techniques and anti-patterns to avoid.&lt;/p&gt;
+
+</description>
+<pubDate>Tue, 23 Jul 2019 17:30:00 +0200</pubDate>
+<link>https://flink.apache.org/2019/07/23/flink-network-stack-2.html</link>
+<guid isPermaLink="true">/2019/07/23/flink-network-stack-2.html</guid>
+</item>
+
+<item>
 <title>Apache Flink 1.8.1 Released</title>
 <description>&lt;p&gt;The Apache Flink community released the first bugfix 
version of the Apache Flink 1.8 series.&lt;/p&gt;
 
diff --git a/content/blog/index.html b/content/blog/index.html
index 71c9647..1a5e52b 100644
--- a/content/blog/index.html
+++ b/content/blog/index.html
@@ -162,6 +162,19 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack Vol. 2: 
Monitoring, Metrics, and that Backpressure Thing</a></h2>
+
+      <p>23 Jul 2019
+       Nico Kruber  &amp; Piotr Nowojski </p>
+
+      <p>In a previous blog post, we presented how Flink’s network stack works 
from the high-level abstractions to the low-level details. This second  post 
discusses monitoring network-related metrics to identify backpressure or 
bottlenecks in throughput and latency.</p>
+
+      <p><a href="/2019/07/23/flink-network-stack-2.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 Released</a></h2>
 
       <p>02 Jul 2019
@@ -288,19 +301,6 @@ for more details.</p>
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a href="/news/2019/03/06/ffsf-preview.html">What 
to expect from Flink Forward San Francisco 2019</a></h2>
-
-      <p>06 Mar 2019
-       Fabian Hueske (<a href="https://twitter.com/fhueske";>@fhueske</a>)</p>
-
-      <p>The third annual Flink Forward conference in San Francisco is just a 
few weeks away. Let's see what Flink Forward SF 2019 has in store for the 
Apache Flink and stream processing communities. This post covers some of its 
highlights!</p>
-
-      <p><a href="/news/2019/03/06/ffsf-preview.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -333,6 +333,16 @@ for more details.</p>
 
     <ul id="markdown-toc">
       
+      <li><a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack 
Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 
Released</a></li>
 
       
diff --git a/content/blog/page2/index.html b/content/blog/page2/index.html
index e639939..084a7f7 100644
--- a/content/blog/page2/index.html
+++ b/content/blog/page2/index.html
@@ -162,6 +162,19 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a href="/news/2019/03/06/ffsf-preview.html">What 
to expect from Flink Forward San Francisco 2019</a></h2>
+
+      <p>06 Mar 2019
+       Fabian Hueske (<a href="https://twitter.com/fhueske";>@fhueske</a>)</p>
+
+      <p>The third annual Flink Forward conference in San Francisco is just a 
few weeks away. Let's see what Flink Forward SF 2019 has in store for the 
Apache Flink and stream processing communities. This post covers some of its 
highlights!</p>
+
+      <p><a href="/news/2019/03/06/ffsf-preview.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2019/02/25/monitoring-best-practices.html">Monitoring Apache Flink 
Applications 101</a></h2>
 
       <p>25 Feb 2019
@@ -294,21 +307,6 @@ Please check the <a 
href="https://issues.apache.org/jira/secure/ReleaseNote.jspa
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2018/10/29/release-1.5.5.html">Apache Flink 1.5.5 Released</a></h2>
-
-      <p>29 Oct 2018
-      </p>
-
-      <p><p>The Apache Flink community released the fifth bugfix version of 
the Apache Flink 1.5 series.</p>
-
-</p>
-
-      <p><a href="/news/2018/10/29/release-1.5.5.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -341,6 +339,16 @@ Please check the <a 
href="https://issues.apache.org/jira/secure/ReleaseNote.jspa
 
     <ul id="markdown-toc">
       
+      <li><a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack 
Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 
Released</a></li>
 
       
diff --git a/content/blog/page3/index.html b/content/blog/page3/index.html
index a81abee..dab64bf 100644
--- a/content/blog/page3/index.html
+++ b/content/blog/page3/index.html
@@ -162,6 +162,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2018/10/29/release-1.5.5.html">Apache Flink 1.5.5 Released</a></h2>
+
+      <p>29 Oct 2018
+      </p>
+
+      <p><p>The Apache Flink community released the fifth bugfix version of 
the Apache Flink 1.5 series.</p>
+
+</p>
+
+      <p><a href="/news/2018/10/29/release-1.5.5.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2018/09/20/release-1.6.1.html">Apache Flink 1.6.1 Released</a></h2>
 
       <p>20 Sep 2018
@@ -296,19 +311,6 @@
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/features/2018/03/01/end-to-end-exactly-once-apache-flink.html">An 
Overview of End-to-End Exactly-Once Processing in Apache Flink (with Apache 
Kafka, too!)</a></h2>
-
-      <p>01 Mar 2018
-       Piotr Nowojski (<a 
href="https://twitter.com/PiotrNowojski";>@PiotrNowojski</a>) &amp; Mike Winters 
(<a href="https://twitter.com/wints";>@wints</a>)</p>
-
-      <p>Flink 1.4.0 introduced a new feature that makes it possible to build 
end-to-end exactly-once applications with Flink and data sources and sinks that 
support transactions.</p>
-
-      <p><a 
href="/features/2018/03/01/end-to-end-exactly-once-apache-flink.html">Continue 
reading &raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -341,6 +343,16 @@
 
     <ul id="markdown-toc">
       
+      <li><a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack 
Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 
Released</a></li>
 
       
diff --git a/content/blog/page4/index.html b/content/blog/page4/index.html
index bd6aa7b..e67e9f0 100644
--- a/content/blog/page4/index.html
+++ b/content/blog/page4/index.html
@@ -162,6 +162,19 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/features/2018/03/01/end-to-end-exactly-once-apache-flink.html">An 
Overview of End-to-End Exactly-Once Processing in Apache Flink (with Apache 
Kafka, too!)</a></h2>
+
+      <p>01 Mar 2018
+       Piotr Nowojski (<a 
href="https://twitter.com/PiotrNowojski";>@PiotrNowojski</a>) &amp; Mike Winters 
(<a href="https://twitter.com/wints";>@wints</a>)</p>
+
+      <p>Flink 1.4.0 introduced a new feature that makes it possible to build 
end-to-end exactly-once applications with Flink and data sources and sinks that 
support transactions.</p>
+
+      <p><a 
href="/features/2018/03/01/end-to-end-exactly-once-apache-flink.html">Continue 
reading &raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2018/02/15/release-1.4.1.html">Apache Flink 1.4.1 Released</a></h2>
 
       <p>15 Feb 2018
@@ -295,21 +308,6 @@ what’s coming in Flink 1.4.0 as well as a preview of what 
the Flink community
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2017/05/16/official-docker-image.html">Introducing Docker Images 
for Apache Flink</a></h2>
-
-      <p>16 May 2017 by Patrick Lucas (Data Artisans) and Ismaël Mejía 
(Talend) (<a href="https://twitter.com/";>@iemejia</a>)
-      </p>
-
-      <p><p>For some time, the Apache Flink community has provided scripts to 
build a Docker image to run Flink. Now, starting with version 1.2.1, Flink will 
have a <a href="https://hub.docker.com/r/_/flink/";>Docker image</a> on the 
Docker Hub. This image is maintained by the Flink community and curated by the 
<a href="https://github.com/docker-library/official-images";>Docker</a> team to 
ensure it meets the quality standards for container images of the Docker 
community.</p>
-
-</p>
-
-      <p><a href="/news/2017/05/16/official-docker-image.html">Continue 
reading &raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -342,6 +340,16 @@ what’s coming in Flink 1.4.0 as well as a preview of what 
the Flink community
 
     <ul id="markdown-toc">
       
+      <li><a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack 
Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 
Released</a></li>
 
       
diff --git a/content/blog/page5/index.html b/content/blog/page5/index.html
index 1243990..8463dfb 100644
--- a/content/blog/page5/index.html
+++ b/content/blog/page5/index.html
@@ -162,6 +162,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2017/05/16/official-docker-image.html">Introducing Docker Images 
for Apache Flink</a></h2>
+
+      <p>16 May 2017 by Patrick Lucas (Data Artisans) and Ismaël Mejía 
(Talend) (<a href="https://twitter.com/";>@iemejia</a>)
+      </p>
+
+      <p><p>For some time, the Apache Flink community has provided scripts to 
build a Docker image to run Flink. Now, starting with version 1.2.1, Flink will 
have a <a href="https://hub.docker.com/r/_/flink/";>Docker image</a> on the 
Docker Hub. This image is maintained by the Flink community and curated by the 
<a href="https://github.com/docker-library/official-images";>Docker</a> team to 
ensure it meets the quality standards for container images of the Docker 
community.</p>
+
+</p>
+
+      <p><a href="/news/2017/05/16/official-docker-image.html">Continue 
reading &raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2017/04/26/release-1.2.1.html">Apache Flink 1.2.1 Released</a></h2>
 
       <p>26 Apr 2017
@@ -289,21 +304,6 @@
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2016/08/24/ff16-keynotes-panels.html">Flink Forward 2016: 
Announcing Schedule, Keynotes, and Panel Discussion</a></h2>
-
-      <p>24 Aug 2016
-      </p>
-
-      <p><p>An update for the Flink community: the <a 
href="http://flink-forward.org/kb_day/day-1/";>Flink Forward 2016 schedule</a> 
is now available online. This year's event will include 2 days of talks from 
stream processing experts at Google, MapR, Alibaba, Netflix, Cloudera, and 
more. Following the talks is a full day of hands-on Flink training.</p>
-
-</p>
-
-      <p><a href="/news/2016/08/24/ff16-keynotes-panels.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -336,6 +336,16 @@
 
     <ul id="markdown-toc">
       
+      <li><a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack 
Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 
Released</a></li>
 
       
diff --git a/content/blog/page6/index.html b/content/blog/page6/index.html
index 2a78432..7d4d19a 100644
--- a/content/blog/page6/index.html
+++ b/content/blog/page6/index.html
@@ -162,6 +162,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2016/08/24/ff16-keynotes-panels.html">Flink Forward 2016: 
Announcing Schedule, Keynotes, and Panel Discussion</a></h2>
+
+      <p>24 Aug 2016
+      </p>
+
+      <p><p>An update for the Flink community: the <a 
href="http://flink-forward.org/kb_day/day-1/";>Flink Forward 2016 schedule</a> 
is now available online. This year's event will include 2 days of talks from 
stream processing experts at Google, MapR, Alibaba, Netflix, Cloudera, and 
more. Following the talks is a full day of hands-on Flink training.</p>
+
+</p>
+
+      <p><a href="/news/2016/08/24/ff16-keynotes-panels.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2016/08/11/release-1.1.1.html">Flink 1.1.1 Released</a></h2>
 
       <p>11 Aug 2016
@@ -293,21 +308,6 @@
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2016/02/11/release-0.10.2.html">Flink 0.10.2 Released</a></h2>
-
-      <p>11 Feb 2016
-      </p>
-
-      <p><p>Today, the Flink community released Flink version 
<strong>0.10.2</strong>, the second bugfix release of the 0.10 series.</p>
-
-</p>
-
-      <p><a href="/news/2016/02/11/release-0.10.2.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -340,6 +340,16 @@
 
     <ul id="markdown-toc">
       
+      <li><a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack 
Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 
Released</a></li>
 
       
diff --git a/content/blog/page7/index.html b/content/blog/page7/index.html
index cff4bad..bc908cb 100644
--- a/content/blog/page7/index.html
+++ b/content/blog/page7/index.html
@@ -162,6 +162,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2016/02/11/release-0.10.2.html">Flink 0.10.2 Released</a></h2>
+
+      <p>11 Feb 2016
+      </p>
+
+      <p><p>Today, the Flink community released Flink version 
<strong>0.10.2</strong>, the second bugfix release of the 0.10 series.</p>
+
+</p>
+
+      <p><a href="/news/2016/02/11/release-0.10.2.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2015/12/18/a-year-in-review.html">Flink 2015: A year in review, and 
a lookout to 2016</a></h2>
 
       <p>18 Dec 2015 by Robert Metzger (<a 
href="https://twitter.com/";>@rmetzger_</a>)
@@ -297,21 +312,6 @@ vertex-centric or gather-sum-apply to Flink dataflows.</p>
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2015/06/24/announcing-apache-flink-0.9.0-release.html">Announcing 
Apache Flink 0.9.0</a></h2>
-
-      <p>24 Jun 2015
-      </p>
-
-      <p><p>The Apache Flink community is pleased to announce the availability 
of the 0.9.0 release. The release is the result of many months of hard work 
within the Flink community. It contains many new features and improvements 
which were previewed in the 0.9.0-milestone1 release and have been polished 
since then. This is the largest Flink release so far.</p>
-
-</p>
-
-      <p><a 
href="/news/2015/06/24/announcing-apache-flink-0.9.0-release.html">Continue 
reading &raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -344,6 +344,16 @@ vertex-centric or gather-sum-apply to Flink dataflows.</p>
 
     <ul id="markdown-toc">
       
+      <li><a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack 
Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 
Released</a></li>
 
       
diff --git a/content/blog/page8/index.html b/content/blog/page8/index.html
index 4956e66..88f15bc 100644
--- a/content/blog/page8/index.html
+++ b/content/blog/page8/index.html
@@ -162,6 +162,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2015/06/24/announcing-apache-flink-0.9.0-release.html">Announcing 
Apache Flink 0.9.0</a></h2>
+
+      <p>24 Jun 2015
+      </p>
+
+      <p><p>The Apache Flink community is pleased to announce the availability 
of the 0.9.0 release. The release is the result of many months of hard work 
within the Flink community. It contains many new features and improvements 
which were previewed in the 0.9.0-milestone1 release and have been polished 
since then. This is the largest Flink release so far.</p>
+
+</p>
+
+      <p><a 
href="/news/2015/06/24/announcing-apache-flink-0.9.0-release.html">Continue 
reading &raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2015/05/14/Community-update-April.html">April 2015 in the Flink 
community</a></h2>
 
       <p>14 May 2015 by Kostas Tzoumas (<a 
href="https://twitter.com/";>@kostas_tzoumas</a>)
@@ -303,21 +318,6 @@ and offers a new API including definition of flexible 
windows.</p>
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a 
href="/news/2015/01/06/december-in-flink.html">December 2014 in the Flink 
community</a></h2>
-
-      <p>06 Jan 2015
-      </p>
-
-      <p><p>This is the first blog post of a “newsletter” like series where we 
give a summary of the monthly activity in the Flink community. As the Flink 
project grows, this can serve as a “tl;dr” for people that are not following 
the Flink dev and user mailing lists, or those that are simply overwhelmed by 
the traffic.</p>
-
-</p>
-
-      <p><a href="/news/2015/01/06/december-in-flink.html">Continue reading 
&raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -350,6 +350,16 @@ and offers a new API including definition of flexible 
windows.</p>
 
     <ul id="markdown-toc">
       
+      <li><a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack 
Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 
Released</a></li>
 
       
diff --git a/content/blog/page9/index.html b/content/blog/page9/index.html
index a3851f8..7772c90 100644
--- a/content/blog/page9/index.html
+++ b/content/blog/page9/index.html
@@ -162,6 +162,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a 
href="/news/2015/01/06/december-in-flink.html">December 2014 in the Flink 
community</a></h2>
+
+      <p>06 Jan 2015
+      </p>
+
+      <p><p>This is the first blog post of a “newsletter” like series where we 
give a summary of the monthly activity in the Flink community. As the Flink 
project grows, this can serve as a “tl;dr” for people that are not following 
the Flink dev and user mailing lists, or those that are simply overwhelmed by 
the traffic.</p>
+
+</p>
+
+      <p><a href="/news/2015/01/06/december-in-flink.html">Continue reading 
&raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a 
href="/news/2014/11/18/hadoop-compatibility.html">Hadoop Compatibility in 
Flink</a></h2>
 
       <p>18 Nov 2014 by Fabian Hüske (<a 
href="https://twitter.com/";>@fhueske</a>)
@@ -271,6 +286,16 @@ academic and open source project that Flink originates 
from.</p>
 
     <ul id="markdown-toc">
       
+      <li><a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack 
Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 
Released</a></li>
 
       
diff --git a/content/css/flink.css b/content/css/flink.css
index 9f13341..5c0f621 100755
--- a/content/css/flink.css
+++ b/content/css/flink.css
@@ -87,6 +87,11 @@ h1, h2, h3, h4, h5, h6 {
     margin-top: -60px;
 }
 
+/* fix conflict with bootstrap's alert */
+.alert h4 {
+       margin-top: -60px;
+}
+
 h1 {
        font-size: 160%;
 }
diff --git 
a/content/img/blog/2019-07-23-network-stack-2/back_pressure_sampling_high.png 
b/content/img/blog/2019-07-23-network-stack-2/back_pressure_sampling_high.png
new file mode 100644
index 0000000..15372fd
Binary files /dev/null and 
b/content/img/blog/2019-07-23-network-stack-2/back_pressure_sampling_high.png 
differ
diff --git a/content/index.html b/content/index.html
index d8e8537..48f71fc 100644
--- a/content/index.html
+++ b/content/index.html
@@ -462,6 +462,9 @@
 
   <dl>
       
+        <dt> <a href="/2019/07/23/flink-network-stack-2.html">Flink Network 
Stack Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></dt>
+        <dd>In a previous blog post, we presented how Flink’s network stack 
works from the high-level abstractions to the low-level details. This second  
post discusses monitoring network-related metrics to identify backpressure or 
bottlenecks in throughput and latency.</dd>
+      
         <dt> <a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 
Released</a></dt>
         <dd><p>The Apache Flink community released the first bugfix version of 
the Apache Flink 1.8 series.</p>
 
@@ -475,9 +478,6 @@
       
         <dt> <a href="/2019/05/19/state-ttl.html">State TTL in Flink 1.8.0: 
How to Automatically Cleanup Application State in Apache Flink</a></dt>
         <dd>A common requirement for many stateful streaming applications is 
to automatically cleanup application state for effective management of your 
state size, or to control how long the application state can be accessed. State 
TTL enables application state cleanup and efficient state size management in 
Apache Flink</dd>
-      
-        <dt> <a href="/2019/05/14/temporal-tables.html">Flux capacitor, huh? 
Temporal Tables and Joins in Streaming SQL</a></dt>
-        <dd>Apache Flink natively supports temporal table joins since the 1.7 
release for straightforward temporal data handling. In this blog post, we 
provide an overview of how this new concept can be leveraged for effective 
point-in-time analysis in streaming scenarios.</dd>
     
   </dl>
 
diff --git a/content/roadmap.html b/content/roadmap.html
index aed804b..780a53b 100644
--- a/content/roadmap.html
+++ b/content/roadmap.html
@@ -180,7 +180,7 @@ under the License.
 
 <div class="page-toc">
 <ul id="markdown-toc">
-  <li><a 
href="#analytics-applications-an-the-roles-of-datastream-dataset-and-table-api" 
id="markdown-toc-analytics-applications-an-the-roles-of-datastream-dataset-and-table-api">Analytics,
 Applications, an the roles of DataStream, DataSet, and Table API</a></li>
+  <li><a 
href="#analytics-applications-and-the-roles-of-datastream-dataset-and-table-api"
 
id="markdown-toc-analytics-applications-and-the-roles-of-datastream-dataset-and-table-api">Analytics,
 Applications, and the roles of DataStream, DataSet, and Table API</a></li>
   <li><a href="#batch-and-streaming-unification" 
id="markdown-toc-batch-and-streaming-unification">Batch and Streaming 
Unification</a></li>
   <li><a href="#fast-batch-bounded-streams" 
id="markdown-toc-fast-batch-bounded-streams">Fast Batch (Bounded 
Streams)</a></li>
   <li><a href="#stream-processing-use-cases" 
id="markdown-toc-stream-processing-use-cases">Stream Processing Use 
Cases</a></li>
@@ -202,7 +202,7 @@ there is consensus that they will happen and what they will 
roughly look like fo
 
 <p><strong>Last Update:</strong> 2019-05-08</p>
 
-<h1 
id="analytics-applications-an-the-roles-of-datastream-dataset-and-table-api">Analytics,
 Applications, an the roles of DataStream, DataSet, and Table API</h1>
+<h1 
id="analytics-applications-and-the-roles-of-datastream-dataset-and-table-api">Analytics,
 Applications, and the roles of DataStream, DataSet, and Table API</h1>
 
 <p>Flink views stream processing as a <a 
href="/flink-architecture.html">unifying paradigm for data processing</a>
 (batch and real-time) and event-driven applications. The APIs are evolving to 
reflect that view:</p>
diff --git a/content/zh/community.html b/content/zh/community.html
index 19133f5..e6136e1 100644
--- a/content/zh/community.html
+++ b/content/zh/community.html
@@ -580,6 +580,12 @@
     <td class="text-center">shaoxuan</td>
   </tr>
   <tr>
+    <td class="text-center"><img 
src="https://avatars3.githubusercontent.com/u/12387855?s=50"; 
class="committer-avatar" /></td>
+    <td class="text-center">Zhijiang Wang</td>
+    <td class="text-center">Committer</td>
+    <td class="text-center">zhijiang</td>
+  </tr>
+  <tr>
     <td class="text-center"><img 
src="https://avatars1.githubusercontent.com/u/1826769?s=50"; 
class="committer-avatar" /></td>
     <td class="text-center">Daniel Warneke</td>
     <td class="text-center">PMC, Committer</td>
diff --git a/content/zh/index.html b/content/zh/index.html
index b0be24d..05ee8d0 100644
--- a/content/zh/index.html
+++ b/content/zh/index.html
@@ -460,6 +460,9 @@
 
   <dl>
       
+        <dt> <a href="/2019/07/23/flink-network-stack-2.html">Flink Network 
Stack Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></dt>
+        <dd>In a previous blog post, we presented how Flink’s network stack 
works from the high-level abstractions to the low-level details. This second  
post discusses monitoring network-related metrics to identify backpressure or 
bottlenecks in throughput and latency.</dd>
+      
         <dt> <a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 
Released</a></dt>
         <dd><p>The Apache Flink community released the first bugfix version of 
the Apache Flink 1.8 series.</p>
 
@@ -473,9 +476,6 @@
       
         <dt> <a href="/2019/05/19/state-ttl.html">State TTL in Flink 1.8.0: 
How to Automatically Cleanup Application State in Apache Flink</a></dt>
         <dd>A common requirement for many stateful streaming applications is 
to automatically cleanup application state for effective management of your 
state size, or to control how long the application state can be accessed. State 
TTL enables application state cleanup and efficient state size management in 
Apache Flink</dd>
-      
-        <dt> <a href="/2019/05/14/temporal-tables.html">Flux capacitor, huh? 
Temporal Tables and Joins in Streaming SQL</a></dt>
-        <dd>Apache Flink natively supports temporal table joins since the 1.7 
release for straightforward temporal data handling. In this blog post, we 
provide an overview of how this new concept can be leveraged for effective 
point-in-time analysis in streaming scenarios.</dd>
     
   </dl>

[flink-web] 05/05: Rebuild website

Reply via email to