Repository: flink
Updated Branches:
  refs/heads/master 21742b2d7 -> 392b2e9a0


[FLINK-5454] [docs] Add documentation how to tune streaming applications for 
large state


Project: http://git-wip-us.apache.org/repos/asf/flink/repo
Commit: http://git-wip-us.apache.org/repos/asf/flink/commit/392b2e9a
Tree: http://git-wip-us.apache.org/repos/asf/flink/tree/392b2e9a
Diff: http://git-wip-us.apache.org/repos/asf/flink/diff/392b2e9a

Branch: refs/heads/master
Commit: 392b2e9a0595a947a24e1504cbfd5f9cabc1c86a
Parents: 21742b2
Author: Stephan Ewen <[email protected]>
Authored: Sun Jan 22 20:51:45 2017 +0100
Committer: Stephan Ewen <[email protected]>
Committed: Sun Jan 22 21:05:16 2017 +0100

----------------------------------------------------------------------
 docs/fig/checkpoint_tuning.svg        | 507 +++++++++++++++++++++++++++++
 docs/monitoring/large_state_tuning.md | 187 ++++++++++-
 2 files changed, 678 insertions(+), 16 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/flink/blob/392b2e9a/docs/fig/checkpoint_tuning.svg
----------------------------------------------------------------------
diff --git a/docs/fig/checkpoint_tuning.svg b/docs/fig/checkpoint_tuning.svg
new file mode 100644
index 0000000..2639f6a
--- /dev/null
+++ b/docs/fig/checkpoint_tuning.svg
@@ -0,0 +1,507 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+<svg
+   xmlns:dc="http://purl.org/dc/elements/1.1/";
+   xmlns:cc="http://creativecommons.org/ns#";
+   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
+   xmlns:svg="http://www.w3.org/2000/svg";
+   xmlns="http://www.w3.org/2000/svg";
+   version="1.1"
+   width="1116.3304"
+   height="896.67767"
+   id="svg2">
+  <defs
+     id="defs4" />
+  <metadata
+     id="metadata7">
+    <rdf:RDF>
+      <cc:Work
+         rdf:about="">
+        <dc:format>image/svg+xml</dc:format>
+        <dc:type
+           rdf:resource="http://purl.org/dc/dcmitype/StillImage"; />
+        <dc:title></dc:title>
+      </cc:Work>
+    </rdf:RDF>
+  </metadata>
+  <g
+     transform="translate(255.30809,4.5480758)"
+     id="layer1">
+    <g
+       transform="translate(-274.66659,45.499311)"
+       id="g2989">
+      <path
+         d="m 106.07643,701.83735 1003.26247,0 0,1.87546 -1003.26247,0 
0,-1.87546 z m 1002.02467,-2.81321 7.5018,3.75094 -7.5018,3.75093 0,-7.50187 z"
+         id="path2991"
+         
style="fill:#000000;fill-opacity:1;fill-rule:nonzero;stroke:#000000;stroke-width:0.03750934px;stroke-linecap:butt;stroke-linejoin:round;stroke-opacity:1;stroke-dasharray:none"
 />
+      <text
+         x="1094.0844"
+         y="738.86139"
+         id="text2993"
+         xml:space="preserve"
+         
style="font-size:22.5056076px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">time</text>
+      <path
+         d="m 106.07643,664.17796 0,28.91971 210.05233,0 0,-28.91971 
-210.05233,0 z"
+         id="path2995"
+         style="fill:#ffc000;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 316.12876,664.17796 0,28.91971 106.11393,0 0,-28.91971 
-106.11393,0 z"
+         id="path2997"
+         style="fill:#70ad47;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 422.24269,664.17796 0,28.91971 210.05233,0 0,-28.91971 
-210.05233,0 z"
+         id="path2999"
+         style="fill:#ffc000;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 632.29502,664.17796 0,28.91971 106.11394,0 0,-28.91971 
-106.11394,0 z"
+         id="path3001"
+         style="fill:#70ad47;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 738.40896,664.17796 0,28.91971 210.05233,0 0,-28.91971 
-210.05233,0 z"
+         id="path3003"
+         style="fill:#ffc000;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 948.46129,664.17796 0,28.91971 105.96391,0 0,-28.91971 
-105.96391,0 z"
+         id="path3005"
+         style="fill:#70ad47;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 106.07643,388.78436 1003.26247,0 0,1.87546 -1003.26247,0 
0,-1.87546 z m 1002.02467,-2.8132 7.5018,3.75093 -7.5018,3.75093 0,-7.50186 z"
+         id="path3007"
+         
style="fill:#000000;fill-opacity:1;fill-rule:nonzero;stroke:#000000;stroke-width:0.03750934px;stroke-linecap:butt;stroke-linejoin:round;stroke-opacity:1;stroke-dasharray:none"
 />
+      <text
+         x="1094.0844"
+         y="425.83395"
+         id="text3009"
+         xml:space="preserve"
+         
style="font-size:22.5056076px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">time</text>
+      <path
+         d="m 106.05767,351.10622 0,28.9197 210.05233,0 0,-28.9197 
-210.05233,0 z"
+         id="path3011"
+         style="fill:#ffc000;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 316.11,351.10622 0,28.9197 4.51988,0 0,-28.9197 -4.51988,0 z"
+         id="path3013"
+         style="fill:#70ad47;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 319.22328,351.10622 0,28.9197 210.05233,0 0,-28.9197 
-210.05233,0 z"
+         id="path3015"
+         style="fill:#ffc000;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 106.07643,145.61128 1003.26247,0 0,1.87546 -1003.26247,0 
0,-1.87546 z m 1002.02467,-2.81321 7.5018,3.75094 -7.5018,3.75093 0,-7.50187 z"
+         id="path3017"
+         
style="fill:#000000;fill-opacity:1;fill-rule:nonzero;stroke:#000000;stroke-width:0.03750934px;stroke-linecap:butt;stroke-linejoin:round;stroke-opacity:1;stroke-dasharray:none"
 />
+      <text
+         x="1094.0844"
+         y="182.71945"
+         id="text3019"
+         xml:space="preserve"
+         
style="font-size:22.5056076px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">time</text>
+      <path
+         d="m 106.04829,108.08318 0,28.91032 28.91033,0 0,-28.91032 
-28.91033,0 z"
+         id="path3021"
+         style="fill:#ffc000;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 134.95862,108.08318 0,28.91032 106.12331,0 0,-28.91032 
-106.12331,0 z"
+         id="path3023"
+         style="fill:#70ad47;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 106.04829,161.53399 0,22.73066"
+         id="path3025"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 239.98479,161.53399 0,22.73066"
+         id="path3027"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 241.08193,172.93683 -135.03364,0"
+         id="path3029"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <text
+         x="452.19522"
+         y="-31.805758"
+         id="text3031"
+         xml:space="preserve"
+         
style="font-size:25.05624199px;font-style:normal;font-weight:bold;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">Regular
 Operation</text>
+      <text
+         x="407.03397"
+         y="-4.1988807"
+         id="text3033"
+         xml:space="preserve"
+         
style="font-size:22.5056076px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">(some
 </text>
+      <text
+         x="475.7511"
+         y="-4.1988807"
+         id="text3035"
+         xml:space="preserve"
+         
style="font-size:22.5056076px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">checkpointing</text>
+      <text
+         x="612.88525"
+         y="-4.1988807"
+         id="text3037"
+         xml:space="preserve"
+         
style="font-size:22.5056076px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">-</text>
+      <text
+         x="620.38715"
+         y="-4.1988807"
+         id="text3039"
+         xml:space="preserve"
+         
style="font-size:22.5056076px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">free
 time)</text>
+      <path
+         d="m 241.08193,108.08318 0,28.91032 28.91033,0 0,-28.91032 
-28.91033,0 z"
+         id="path3041"
+         style="fill:#ffc000;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 269.99226,108.08318 0,28.9197 106.13269,0 0,-28.9197 
-106.13269,0 z"
+         id="path3043"
+         style="fill:#70ad47;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 376.12495,108.08318 0,28.9197 28.90095,0 0,-28.9197 -28.90095,0 
z"
+         id="path3045"
+         style="fill:#ffc000;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 405.0259,108.08318 0,28.9197 106.13269,0 0,-28.9197 -106.13269,0 
z"
+         id="path3047"
+         style="fill:#70ad47;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 106.05767,406.75133 0,22.73066"
+         id="path3049"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 239.98479,406.75133 0,22.73066"
+         id="path3051"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 241.09131,418.15417 -135.03364,0"
+         id="path3053"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <text
+         x="88.081291"
+         y="213.86913"
+         id="text3055"
+         xml:space="preserve"
+         
style="font-size:19.95497131px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">Checkpoint
 Interval</text>
+      <path
+         d="m 241.08193,161.53399 0,22.73066"
+         id="path3057"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 375.01843,161.53399 0,22.73066"
+         id="path3059"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 376.12495,172.93683 -135.03364,0"
+         id="path3061"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 373.93066,161.53399 0,22.73066"
+         id="path3063"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 507.87652,161.53399 0,22.73066"
+         id="path3065"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 508.9643,172.93683 -135.03364,0"
+         id="path3067"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 508.9643,161.53399 0,22.73066"
+         id="path3069"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 642.91017,161.55275 0,22.73066"
+         id="path3071"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 643.99794,172.95559 -135.03364,0"
+         id="path3073"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 507.87652,108.08318 0,28.9197 28.90095,0 0,-28.9197 -28.90095,0 
z"
+         id="path3075"
+         style="fill:#ffc000;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 536.79623,108.10193 0,28.91971 106.11394,0 0,-28.91971 
-106.11394,0 z"
+         id="path3077"
+         style="fill:#70ad47;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <text
+         x="17.819006"
+         y="47.55584"
+         id="text3079"
+         xml:space="preserve"
+         
style="font-size:19.95497131px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">Duration
 of Checkpoint</text>
+      <text
+         x="266.45242"
+         y="59.093952"
+         id="text3081"
+         xml:space="preserve"
+         
style="font-size:19.95497131px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">Processing
 without</text>
+      <text
+         x="266.45242"
+         y="83.099937"
+         id="text3083"
+         xml:space="preserve"
+         
style="font-size:19.95497131px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">a
 checkpoint in progress </text>
+      <path
+         d="m 106.07643,719.16666 0,22.76817"
+         id="path3085"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 239.98479,719.16666 0,22.76817"
+         id="path3087"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 241.11007,730.60701 -135.03364,0"
+         id="path3089"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 318.60437,646.96117 0,11.25281 -2.8132,0 0,-11.25281 2.8132,0 z 
m 0,19.69241 0,11.2528 -2.8132,0 0,-11.2528 2.8132,0 z m 0,19.69241 0,11.2528 
-2.8132,0 0,-11.2528 2.8132,0 z m 0,19.6924 0,11.25281 -2.8132,0 0,-11.25281 
2.8132,0 z m 0,19.69241 0,11.2528 -2.8132,0 0,-11.2528 2.8132,0 z m 0,19.6924 
0,11.25281 -2.8132,0 0,-11.25281 2.8132,0 z"
+         id="path3091"
+         
style="fill:#000000;fill-opacity:1;fill-rule:nonzero;stroke:#000000;stroke-width:0.03750934px;stroke-linecap:butt;stroke-linejoin:round;stroke-opacity:1;stroke-dasharray:none"
 />
+      <text
+         x="88.37674"
+         y="456.98352"
+         id="text3093"
+         xml:space="preserve"
+         
style="font-size:19.95497131px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">Checkpoint
 Interval</text>
+      <text
+         x="285.33713"
+         y="475.22153"
+         id="text3095"
+         xml:space="preserve"
+         
style="font-size:19.95497131px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">Checkpoint
 completes late,</text>
+      <text
+         x="285.33713"
+         y="499.22751"
+         id="text3097"
+         xml:space="preserve"
+         
style="font-size:19.95497131px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">Immediately
 triggers next</text>
+      <path
+         d="m 321.60512,453.07537 -4.36984,-21.17402 2.75694,-0.56264 
4.36984,21.15527 -2.75694,0.58139 z m -6.8267,-19.22354 2.41935,-9.13352 
5.8327,7.42685 -8.25205,1.70667 z"
+         id="path3099"
+         
style="fill:#000000;fill-opacity:1;fill-rule:nonzero;stroke:#000000;stroke-width:0.01875467px;stroke-linecap:butt;stroke-linejoin:round;stroke-opacity:1;stroke-dasharray:none"
 />
+      <text
+         x="102.75316"
+         y="767.70654"
+         id="text3101"
+         xml:space="preserve"
+         
style="font-size:19.95497131px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">Base
 Checkpoint</text>
+      <text
+         x="145.66385"
+         y="791.71252"
+         id="text3103"
+         xml:space="preserve"
+         
style="font-size:19.95497131px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">Interval</text>
+      <path
+         d="m 107.38925,55.288774 12.26556,37.20927 -2.67254,0.881469 
-12.26556,-37.218647 2.67254,-0.872092 z m 14.49737,34.986841 -1.3691,9.339827 
-6.64853,-6.695418 8.01763,-2.644409 z"
+         id="path3105"
+         
style="fill:#000000;fill-opacity:1;fill-rule:nonzero;stroke:#000000;stroke-width:0.00937734px;stroke-linecap:butt;stroke-linejoin:round;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 255.01666,76.303384 -50.33754,20.948969 1.08777,2.597522 
50.32816,-20.948969 -1.07839,-2.597522 z m -50.12187,17.816939 
-6.17028,7.136157 9.41484,0.65641 -3.24456,-7.792567 z"
+         id="path3107"
+         
style="fill:#000000;fill-opacity:1;fill-rule:nonzero;stroke:#000000;stroke-width:0.00937734px;stroke-linecap:butt;stroke-linejoin:round;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 642.91017,161.55275 0,22.73066"
+         id="path3109"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 776.85603,161.55275 0,22.73066"
+         id="path3111"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 777.94381,172.95559 -135.03364,0"
+         id="path3113"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 641.82239,108.10193 0,28.91971 29.06975,0 0,-28.91971 
-29.06975,0 z"
+         id="path3115"
+         style="fill:#ffc000;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 670.89214,108.10193 0,28.91971 105.96389,0 0,-28.91971 
-105.96389,0 z"
+         id="path3117"
+         style="fill:#70ad47;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 777.94381,161.55275 0,22.73066"
+         id="path3119"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 911.88967,161.55275 0,22.73066"
+         id="path3121"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 912.97745,172.95559 -135.03364,0"
+         id="path3123"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 776.85603,108.10193 0,28.91971 29.06975,0 0,-28.91971 
-29.06975,0 z"
+         id="path3125"
+         style="fill:#ffc000;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 805.92578,108.10193 0,28.91971 105.96389,0 0,-28.91971 
-105.96389,0 z"
+         id="path3127"
+         style="fill:#70ad47;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 911.88967,161.55275 0,22.73066"
+         id="path3129"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 1045.9856,161.55275 0,22.73066"
+         id="path3131"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 1046.9233,172.95559 -135.03363,0"
+         id="path3133"
+         
style="fill:none;stroke:#000000;stroke-width:2.81320095px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 910.95194,108.10193 0,28.91971 28.91971,0 0,-28.91971 
-28.91971,0 z"
+         id="path3135"
+         style="fill:#ffc000;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 939.87165,108.10193 0,28.91971 106.11395,0 0,-28.91971 
-106.11395,0 z"
+         id="path3137"
+         style="fill:#70ad47;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 527.40014,351.10622 0,28.9197 4.53863,0 0,-28.9197 -4.53863,0 z"
+         id="path3139"
+         style="fill:#70ad47;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 530.68221,351.12497 0,28.91971 210.05233,0 0,-28.91971 
-210.05233,0 z"
+         id="path3141"
+         style="fill:#ffc000;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 738.85907,351.12497 0,28.91971 4.38859,0 0,-28.91971 -4.38859,0 
z"
+         id="path3143"
+         style="fill:#70ad47;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 742.00985,351.12497 0,28.91971 210.05233,0 0,-28.91971 
-210.05233,0 z"
+         id="path3145"
+         style="fill:#ffc000;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 950.18672,351.12497 0,28.91971 4.53863,0 0,-28.91971 -4.53863,0 
z"
+         id="path3147"
+         style="fill:#70ad47;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 953.29999,351.12497 0,28.91971 92.68561,0 0,-28.91971 
-92.68561,0 z"
+         id="path3149"
+         style="fill:#ffc000;fill-opacity:1;fill-rule:evenodd;stroke:none" />
+      <path
+         d="m 318.60437,329.70714 0,11.2528 -2.8132,0 0,-11.2528 2.8132,0 z m 
0,19.6924 0,11.25281 -2.8132,0 0,-11.25281 2.8132,0 z m 0,19.69241 0,11.2528 
-2.8132,0 0,-11.2528 2.8132,0 z m 0,19.69241 0,11.2528 -2.8132,0 0,-11.2528 
2.8132,0 z m 0,19.6924 0,7.14553 -2.8132,0 0,-7.14553 2.8132,0 z"
+         id="path3151"
+         
style="fill:#000000;fill-opacity:1;fill-rule:nonzero;stroke:#000000;stroke-width:0.01875467px;stroke-linecap:butt;stroke-linejoin:round;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 530.0633,329.70714 0,11.2528 -2.8132,0 0,-11.2528 2.8132,0 z m 
0,19.6924 0,11.25281 -2.8132,0 0,-11.25281 2.8132,0 z m 0,19.69241 0,11.2528 
-2.8132,0 0,-11.2528 2.8132,0 z m 0,19.69241 0,11.2528 -2.8132,0 0,-11.2528 
2.8132,0 z m 0,19.6924 0,7.14553 -2.8132,0 0,-7.14553 2.8132,0 z"
+         id="path3153"
+         
style="fill:#000000;fill-opacity:1;fill-rule:nonzero;stroke:#000000;stroke-width:0.01875467px;stroke-linecap:butt;stroke-linejoin:round;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 741.37219,329.70714 0,11.2528 -2.8132,0 0,-11.2528 2.8132,0 z m 
0,19.6924 0,11.25281 -2.8132,0 0,-11.25281 2.8132,0 z m 0,19.69241 0,11.2528 
-2.8132,0 0,-11.2528 2.8132,0 z m 0,19.69241 0,11.2528 -2.8132,0 0,-11.2528 
2.8132,0 z m 0,19.6924 0,7.16429 -2.8132,0 0,-7.16429 2.8132,0 z"
+         id="path3155"
+         
style="fill:#000000;fill-opacity:1;fill-rule:nonzero;stroke:#000000;stroke-width:0.03750934px;stroke-linecap:butt;stroke-linejoin:round;stroke-opacity:1;stroke-dasharray:none"
 />
+      <path
+         d="m 952.66233,329.70714 0,11.2528 -2.8132,0 0,-11.2528 2.8132,0 z m 
0,19.6924 0,11.25281 -2.8132,0 0,-11.25281 2.8132,0 z m 0,19.69241 0,11.2528 
-2.8132,0 0,-11.2528 2.8132,0 z m 0,19.69241 0,11.2528 -2.8132,0 0,-11.2528 
2.8132,0 z m 0,19.6924 0,7.16429 -2.8132,0 0,-7.16429 2.8132,0 z"
+         id="path3157"
+         
style="fill:#000000;fill-opacity:1;fill-rule:nonzero;stroke:#000000;stroke-width:0.03750934px;stroke-linecap:butt;stroke-linejoin:round;stroke-opacity:1;stroke-dasharray:none"
 />
+      <text
+         x="148.77711"
+         y="842.66461"
+         id="text3159"
+         xml:space="preserve"
+         
style="font-size:19.95497131px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">Checkpoint
 completes late</text>
+      <path
+         d="m 287.35909,819.16657 26.78167,-53.8259 -2.51313,-1.27532 
-26.78167,53.82591 2.51313,1.27531 z m 28.69465,-51.31278 -0.0375,-9.45235 
-7.53938,5.70142 7.57689,3.75093 z"
+         id="path3161"
+         
style="fill:#000000;fill-opacity:1;fill-rule:nonzero;stroke:#000000;stroke-width:0.03750934px;stroke-linecap:butt;stroke-linejoin:round;stroke-opacity:1;stroke-dasharray:none"
 />
+      <text
+         x="395.0498"
+         y="800.99463"
+         id="text3163"
+         xml:space="preserve"
+         
style="font-size:19.95497131px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">Min.
 </text>
+      <text
+         x="438.56064"
+         y="800.99463"
+         id="text3165"
+         xml:space="preserve"
+         
style="font-size:19.95497131px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">t</text>
+      <text
+         x="444.11203"
+         y="800.99463"
+         id="text3167"
+         xml:space="preserve"
+         
style="font-size:19.95497131px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">ime
 between checkpoints</text>
+      <path
+         d="m 387.20896,784.77051 -9.48986,-10.09002 2.06301,-1.95048 
9.48987,10.12752 -2.06302,1.91298 z m -10.57763,-7.16429 -2.70067,-9.03975 
8.8522,3.26331 -6.15153,5.77644 z"
+         id="path3169"
+         
style="fill:#000000;fill-opacity:1;fill-rule:nonzero;stroke:#000000;stroke-width:0.03750934px;stroke-linecap:butt;stroke-linejoin:round;stroke-opacity:1;stroke-dasharray:none"
 />
+      <text
+         x="441.10461"
+         y="276.30859"
+         id="text3171"
+         xml:space="preserve"
+         
style="font-size:25.05624199px;font-style:normal;font-weight:bold;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">Degraded
 Operation</text>
+      <text
+         x="431.65225"
+         y="303.91547"
+         id="text3173"
+         xml:space="preserve"
+         
style="font-size:22.5056076px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">(constantly
 </text>
+      <text
+         x="546.73096"
+         y="303.91547"
+         id="text3175"
+         xml:space="preserve"
+         
style="font-size:22.5056076px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">checkpointing</text>
+      <text
+         x="683.86511"
+         y="303.91547"
+         id="text3177"
+         xml:space="preserve"
+         
style="font-size:22.5056076px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">)</text>
+      <text
+         x="423.17554"
+         y="599.56708"
+         id="text3179"
+         xml:space="preserve"
+         
style="font-size:25.05624199px;font-style:normal;font-weight:bold;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">Guaranteeing
 Progress</text>
+      <text
+         x="375.76373"
+         y="627.17395"
+         id="text3181"
+         xml:space="preserve"
+         
style="font-size:22.5056076px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">(ensuring
 some checkpoint</text>
+      <text
+         x="644.33063"
+         y="627.17395"
+         id="text3183"
+         xml:space="preserve"
+         
style="font-size:22.5056076px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">-</text>
+      <text
+         x="651.83246"
+         y="627.17395"
+         id="text3185"
+         xml:space="preserve"
+         
style="font-size:22.5056076px;font-style:normal;font-weight:normal;text-align:start;text-anchor:start;fill:#000000;font-family:Arial">free
 time)</text>
+      <path
+         d="m 422.24269,722.16741 c 0,9.18979 -11.36533,16.61664 
-25.35632,16.61664 l 0,0 c -13.99098,0 -25.35631,7.46436 -25.35631,16.65415 
0,-9.18979 -11.32782,-16.65415 -25.31881,-16.65415 l -4.76369,0 c -13.99098,0 
-25.3188,-7.42685 -25.3188,-16.61664"
+         id="path3187"
+         
style="fill:none;stroke:#000000;stroke-width:1.87546718px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
 />
+    </g>
+  </g>
+</svg>

http://git-wip-us.apache.org/repos/asf/flink/blob/392b2e9a/docs/monitoring/large_state_tuning.md
----------------------------------------------------------------------
diff --git a/docs/monitoring/large_state_tuning.md 
b/docs/monitoring/large_state_tuning.md
index c49c106..053efd6 100644
--- a/docs/monitoring/large_state_tuning.md
+++ b/docs/monitoring/large_state_tuning.md
@@ -22,41 +22,196 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-This page gives a guide how to improve and tune applications that use large 
state.
+This page gives a guide how to configure and tune applications that use large 
state.
 
 * ToC
 {:toc}
 
+## Overview
+
+For Flink applications to run reliably at large scale, two conditions must be 
fulfilled:
+
+  - The application needs to be able to take checkpoints reliably
+
+  - The resources need to be sufficient catch up with the input data streams 
after a failure
+
+The first sections discuss how to get well performing checkpoints at scale.
+The last section explains some best practices concerning planning how many 
resources to use.
+
+
 ## Monitoring State and Checkpoints
 
-  - Checkpoint statistics overview
-  - Interpret time until checkpoints
-  - Synchronous vs. asynchronous checkpoint time
+The easiest way to monitor checkpoint behavior is via the UI's checkpoint 
section. The documentation
+for [checkpoint monitoring](checkpoint_monitoring.html) shows how to access 
the available checkpoint
+metrics.
+
+The two numbers that are of particular interest when scaling up checkpoints 
are:
+
+  - The time until operators start their checkpoint: This time is currently 
not exposed directly, but corresponds
+    to:
+    
+    `checkpoint_start_delay = end_to_end_duration - synchronous_duration - 
asynchronous_duration`
+
+    When the time to trigger the checkpoint is constantly very high, it means 
that the *checkpoint barriers* need a long
+    time to travel from the source to the operators. That typically indicates 
that the system is operating under a
+    constant backpressure.
+
+  - The amount of data buffered during alignments. For exactly-once semantics, 
Flink *aligns* the streams at
+    operators that receive multiple input streams, buffering some data for 
that alignment.
+    The buffered data volume is ideally low - higher amounts means that 
checkpoint barriers are reveived at
+    very different times from the different input streams.
+
+Note that when the here indicated numbers can be occasionally high in the 
presence of transient backpressure, data skew,
+or network issues. However, if the numbers are constantly very high, it means 
that Flink puts many resources into checkpointing.
+
 
 ## Tuning Checkpointing
 
-  - Checkpoint interval
-  - Getting work done between checkpoints (min time between checkpoints)
+Checkpoints are triggered at regular intervals that applications can 
configure. When a checkpoint takes longer
+to complete than the checkpoint interval, the next checkpoint is not triggered 
before the in-progress checkpoint
+completes. By default the next checkpoint will then be triggered immediately 
once the ongoing checkpoint completes.
+
+When checkpoints end up frequently taking longer than the base interval (for 
example because state
+grew larger than planned, or the storage where checkpoints are stored is 
temporarily slow),
+the system is constantly taking checkpoints (new ones are started immediately 
once ongoing once finish).
+That can mean that too many resources are constantly tied up in checkpointing 
and that the operators make too
+little progress. This behavior has less impact on streaming applications that 
use asynchronously checkpointed state,
+but may still have an impact on overall application performance.
+
+To prevent such a situation, applications can define a *minimum duration 
between checkpoints*:
+
+`StreamExecutionEnvironment.getCheckpointConfig().setMinPauseBetweenCheckpoints(milliseconds)`
+
+This duration is the minimum time interval that must pass between the end of 
the latest checkpoint and the beginning
+of the next. The figure below illustrates how this impacts checkpointing.
+
+<img src="../fig/checkpoint_tuning.svg" class="center" width="80%" 
alt="Illustration how the minimum-time-between-checkpoints parameter affects 
checkpointing behavior."/>
+
+*Note:* Applications can be configured (via the `CheckpointConfig`) to allow 
multiple checkpoints to be in progress at
+the same time. For applications with large state in Flink, this often ties up 
too many resources into the checkpointing.
+When a savepoint is manually triggered, it may be in process concurrently with 
an ongoing checkpoint.
+
 
 ## Tuning Network Buffers
 
-  - getting a good number of buffers to use
-  - monitoring if too many buffers cause too much inflight data
+The number of network buffers is a parameter that can currently have an effect 
on checkpointing at large scale.
+The Flink community is working on eliminating that parameter in the next 
versions of Flink.
+
+The number of network buffers defines how much data a TaskManager can hold 
in-flight before back-pressure kicks in.
+A very high number of network buffers means that a lot of data may be in the 
stream network channels when a checkpoint
+is started. Because the checkpoint barriers travel with that data (see 
[description of how checkpointing 
works](../internals/stream_checkpointing.html)),
+a lot of in-flight data means that the barriers have to wait for that data to 
be transported/processed before arriving
+at the target operator.
+
+Having a lot of data in-flight also does not speed up the data processing as a 
whole. It only means that data is picked up faster
+from the data source (log, files, message queue) and buffered longer in Flink. 
Having fewer network buffers means that
+data is picked up from the source more immediately before it is actually being 
processed, which is generally desirable.
+The number of network buffers should hence not be set arbitrarily large, but 
to a low multiple (such as 2x) of the
+minimum number of required buffers.
+
+
+## Make state checkpointing Asynchronous where possible
 
-## Make checkpointing asynchronous where possible
+When state is *asynchronously* snapshotted, the checkpoints scale better than 
when the state is *synchronously* snapshotted.
+Especially in more complex streaming applications with multiple joins, 
Co-functions, or windows, this may have a profound
+impact.
 
-  - large state should be on keyed state, not operator state, because keyed 
state is managed, operator state not (subject to change in future versions)
+To get state to be snapshotted asynchronously, applications have to do two 
things:
+
+  1. Use state that is [managed by Flink](../dev/stream/state.html): Managed 
state means that Flink provides the data
+     structure in which the state is stored. Currently, this is true for 
*keyed state*, which is abstracted behind the
+     interfaces like `ValueState`, `ListState`, `ReducingState`, ...
+
+  2. Use a state backend that supports asynchronous snapshots. In Flink 1.2, 
only the RocksDB state backend uses
+     fully asynchronous snapshots.
+
+The above two points imply that (in Flink 1.2) large state should generally be 
kept as keyed state, not as operator state.
+This is subject to change with the planned introduction of *managed operator 
state*.
 
-  - asynchronous snapshots preferrable. long synchronous snapshot times can 
cause problems on large state and complex topogies. move to RocksDB for that
 
 ## Tuning RocksDB
 
-  - Predefined options
-  - Custom Options
+The state storage workhorse of many large scale Flink streaming applications 
is the *RocksDB State Backend*.
+The backend scales well beyond main memory and reliably stores large [keyed 
state](../dev/stream/state.html).
+
+Unfortunately, RocksDB's performance can vary with configuration, and there is 
little documentation on how to tune
+RocksDB properly. For example, the default configuration is tailored towards 
SSDs and performs suboptimal
+on spinning disks.
+
+**Passing Options to RocksDB**
+
+{% highlight java %}
+RocksDBStateBackend.setOptions(new MyOptions());
+
+public class MyOptions implements OptionsFactory {
+
+    @Override
+    public DBOptions createDBOptions() {
+        return new DBOptions()
+            .setIncreaseParallelism(4)
+            .setUseFsync(false)
+            .setDisableDataSync(true);
+    }
+
+    @Override
+    public ColumnFamilyOptions createColumnOptions() {
+
+        return new ColumnFamilyOptions()
+            .setTableFormatConfig(
+                new BlockBasedTableConfig()
+                    .setBlockCacheSize(256 * 1024 * 1024)  // 256 MB
+                    .setBlockSize(128 * 1024));            // 128 KB
+    }
+}
+{% endhighlight %}
+
+**Predefined Options**
+
+Flink provides some predefined collections of option for RocksDB for different 
settings, which can be set for example via
+`RocksDBStateBacked.setPredefinedOptions(PredefinedOptions.SPINNING_DISK_OPTIMIZED_HIGH_MEM)`.
+
+We expect to accumulate more such profiles over time. Feel free to contribute 
such predefined option profiles when you
+found a set of options that work well and seem representative for certain 
workloads.
+
+**Important:** RocksDB is a native library, whose allocated memory not from 
the JVM, but directly from the process'
+native memory. Any memory you assign to RocksDB will have to be accounted for, 
typically by decreasing the JVM heap size
+of the TaskManagers by the same amount. Not doing that may result in 
YARN/Mesos/etc terminating the JVM processes for
+allocating more memory than configures.
+
+
+## Capacity Planning
+
+This section discusses how to decide how many resources should be used for a 
Flink job to run reliably.
+The basic rules of thumb for capacity planning are:
+
+  - Normal operation should have enough capacity to not operate under constant 
*back pressure*.
+    See [back pressure monitoring](back_pressure.html) for details on how to 
check whether the application runs under back pressure.
+
+  - Provision some extra resources on top of the resources needed to run the 
program back-pressure-free during failure-free time.
+    These resources are needed to "catch up" with the input data that 
accumulated during the time the application
+    was recovering.
+    How much that should be depends on how long recovery operations usually 
take (which depends on the size of the state
+    that needs to be loaded into the new TaskManagers on a failover) and how 
fast the scenario requires failures to recover.
+
+    *Important*: The base line should to be established with checkpointing 
activated, because checkpointing ties up
+    some amount of resources (such as network bandwidth).
+
+  - Temporary back pressure is usually okay, and an essential part of 
execution flow control during load spikes,
+    during catch-up phases, or when external systems (that are written to in a 
sink) exhibit temporary slowdown.
+
+  - Certain operations (like large windows) result in a spiky load for their 
downstream operators: 
+    In the case of windows, the downstream operators may have little to do 
while the window is being built,
+    and have a load to do when the windows are emitted.
+    The planning for the downstream parallelism needs to take into account how 
much the windows emit and how
+    fast such a spike needs to be processed.
+
+**Important:** In order to allow for adding resources later, make sure to set 
the *maximum parallelism* of the
+data stream program to a reasonable number. The maximum parallelism defines 
how high you can set the programs
+parallelism when re-scaling the program (via a savepoint).
 
-## Capacity planning
+Flink's internal bookkeeping tracks parallel state in the granularity of 
max-parallelism-many *key groups*.
+Flink's design strives to make it efficient to have a very high value for the 
maximum parallelism, even if
+executing the program with a low parallelism. 
 
-  - Normal operation should not be constantly back pressured (link to back 
pressure monitor)
-  - Allow for some excess capacity to support catch-up in case of failures and 
checkpoint alignment skew (due to data skew or bad nodes)
 
 

Reply via email to