infoverload commented on a change in pull request #17595:
URL: https://github.com/apache/flink/pull/17595#discussion_r738520254



##########
File path: docs/content/docs/ops/state/checkpoints_backpressure.md
##########
@@ -23,7 +24,40 @@ KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
-# Unaligned checkpoints
+# Checkpointing under backpressure
+
+Normally aligned checkpointing time is dominated by the synchronous and 
asynchronous parts of the 
+checkpointing process. However, when Flink job is running under a heavy 
backpressure, the dominant 
+factor in the end to end time of a checkpoint can be the time to propagate 
checkpoint barriers to 
+all operators/subtasks (why this is the case is explained in the overview of 
the
+[checkpointing process]({{< ref "docs/concepts/stateful-stream-processing" 
>}}#checkpointing)).
+This can be observed by high
+[alignment time and start delay metrics]({{< ref 
"docs/ops/monitoring/checkpoint_monitoring" >}}#history-tab).
+When this happens and becomes an issue there are basically three ways to 
address this problem:
+1. Remove the source of the backpressure, by either optimising the Flink job, 
adjusting Flink or JVM configuration or simply by scaling up.
+2. Reduce an amount of the buffered in-flight data in the Flink job.
+3. Enable unaligned checkpoints.
+
+Note that those options are not mutually exclusive, and you can combine them 
together. This document
+focuses on the latter two options.

Review comment:
       ```suggestion
   Normally aligned checkpointing time is dominated by the synchronous and 
asynchronous parts of the 
   checkpointing process. However, when a Flink job is running under heavy 
backpressure, the dominant 
   factor in the end-to-end time of a checkpoint can be the time to propagate 
checkpoint barriers to 
   all operators/subtasks. This is explained in the overview of the
   [checkpointing process]({{< ref "docs/concepts/stateful-stream-processing" 
>}}#checkpointing)).
   and can be observed by high
   [alignment time and start delay metrics]({{< ref 
"docs/ops/monitoring/checkpoint_monitoring" >}}#history-tab).
   When this happens and becomes an issue, there are three ways to address the 
problem:
   1. Remove the backpressure source by optimizing the Flink job, by adjusting 
Flink or JVM configurations, or by scaling up.
   2. Reduce the amount of buffered in-flight data in the Flink job.
   3. Enable unaligned checkpoints.
   
   These options are not mutually exclusive and can be combined together. This 
document
   focuses on the latter two options.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to