infoverload commented on a change in pull request #17595:
URL: https://github.com/apache/flink/pull/17595#discussion_r738520254
##########
File path: docs/content/docs/ops/state/checkpoints_backpressure.md
##########
@@ -23,7 +24,40 @@ KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
-# Unaligned checkpoints
+# Checkpointing under backpressure
+
+Normally aligned checkpointing time is dominated by the synchronous and
asynchronous parts of the
+checkpointing process. However, when Flink job is running under a heavy
backpressure, the dominant
+factor in the end to end time of a checkpoint can be the time to propagate
checkpoint barriers to
+all operators/subtasks (why this is the case is explained in the overview of
the
+[checkpointing process]({{< ref "docs/concepts/stateful-stream-processing"
>}}#checkpointing)).
+This can be observed by high
+[alignment time and start delay metrics]({{< ref
"docs/ops/monitoring/checkpoint_monitoring" >}}#history-tab).
+When this happens and becomes an issue there are basically three ways to
address this problem:
+1. Remove the source of the backpressure, by either optimising the Flink job,
adjusting Flink or JVM configuration or simply by scaling up.
+2. Reduce an amount of the buffered in-flight data in the Flink job.
+3. Enable unaligned checkpoints.
+
+Note that those options are not mutually exclusive, and you can combine them
together. This document
+focuses on the latter two options.
Review comment:
```suggestion
Normally aligned checkpointing time is dominated by the synchronous and
asynchronous parts of the
checkpointing process. However, when a Flink job is running under heavy
backpressure, the dominant
factor in the end-to-end time of a checkpoint can be the time to propagate
checkpoint barriers to
all operators/subtasks. This is explained in the overview of the
[checkpointing process]({{< ref "docs/concepts/stateful-stream-processing"
>}}#checkpointing)).
and can be observed by high
[alignment time and start delay metrics]({{< ref
"docs/ops/monitoring/checkpoint_monitoring" >}}#history-tab).
When this happens and becomes an issue, there are three ways to address the
problem:
1. Remove the backpressure source by optimizing the Flink job, by adjusting
Flink or JVM configurations, or by scaling up.
2. Reduce the amount of buffered in-flight data in the Flink job.
3. Enable unaligned checkpoints.
These options are not mutually exclusive and can be combined together. This
document
focuses on the latter two options.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]