rkhachatryan commented on a change in pull request #12722:
URL: https://github.com/apache/flink/pull/12722#discussion_r443390884
##########
File path: docs/concepts/stateful-stream-processing.md
##########
@@ -242,6 +246,42 @@ updates to that state.
See [Restart Strategies]({% link dev/task_failure_recovery.md
%}#restart-strategies) for more information.
+### Unaligned Checkpointing
+
+Starting with Flink 1.11, checkpointing can also be performed unaligned.
+The basic idea is that checkpoints can overtake all in-flight data as long as
+the in-flight data becomes part of the operator state.
+
+<div style="text-align: center">
+ <img src="{% link fig/stream_unaligning.svg %}" alt="Unaligned
checkpointing" style="width:100%; padding-top:10px; padding-bottom:10px;" />
+</div>
+
+The figure depicts how an operator handle unaligned checkpoint barriers:
+
+- The operator reacts on the first barrier that is stored in input buffers.
+- It immediately forwards the barrier to the downstream operator by adding it
+ to the end of the output buffers.
+- The operator marks all overtaken buffers, which are asynchronously stored in
+ the state backend together with the other operator state.
Review comment:
```suggestion
- The operator marks all overtaken records to be stored asynchronously and
creates a snapshot of its own state.
```
##########
File path: docs/concepts/stateful-stream-processing.md
##########
@@ -123,7 +123,7 @@ provided by Flink's connectors.
snapshots, we use the words *snapshot* and *checkpoint* interchangeably. Often
we also use the term *snapshot* to mean either *checkpoint* or *savepoint*.
-### Checkpointing
+### Aligned Checkpointing
The central part of Flink's fault tolerance mechanism is drawing consistent
Review comment:
To my understanding, there are were papers: one by Lamport and one by
[DataArtisans](https://arxiv.org/pdf/1506.08603.pdf).
One of the differences is that the former proposed to persist channel state,
while the latter proposed alignment (I might be wrong).
So it's probably better to move reference to the paper by Lamport to
Unaligned section and mention "Flink" paper in this Aligned section.
##########
File path: docs/concepts/stateful-stream-processing.md
##########
@@ -242,6 +246,42 @@ updates to that state.
See [Restart Strategies]({% link dev/task_failure_recovery.md
%}#restart-strategies) for more information.
+### Unaligned Checkpointing
+
+Starting with Flink 1.11, checkpointing can also be performed unaligned.
+The basic idea is that checkpoints can overtake all in-flight data as long as
+the in-flight data becomes part of the operator state.
+
+<div style="text-align: center">
+ <img src="{% link fig/stream_unaligning.svg %}" alt="Unaligned
checkpointing" style="width:100%; padding-top:10px; padding-bottom:10px;" />
+</div>
+
+The figure depicts how an operator handle unaligned checkpoint barriers:
+
+- The operator reacts on the first barrier that is stored in input buffers.
Review comment:
```suggestion
- The operator reacts on the first barrier that is stored in its input
buffers.
```
##########
File path: docs/concepts/stateful-stream-processing.md
##########
@@ -242,6 +246,42 @@ updates to that state.
See [Restart Strategies]({% link dev/task_failure_recovery.md
%}#restart-strategies) for more information.
+### Unaligned Checkpointing
+
+Starting with Flink 1.11, checkpointing can also be performed unaligned.
+The basic idea is that checkpoints can overtake all in-flight data as long as
+the in-flight data becomes part of the operator state.
+
+<div style="text-align: center">
+ <img src="{% link fig/stream_unaligning.svg %}" alt="Unaligned
checkpointing" style="width:100%; padding-top:10px; padding-bottom:10px;" />
+</div>
+
+The figure depicts how an operator handle unaligned checkpoint barriers:
+
+- The operator reacts on the first barrier that is stored in input buffers.
+- It immediately forwards the barrier to the downstream operator by adding it
+ to the end of the output buffers.
+- The operator marks all overtaken buffers, which are asynchronously stored in
+ the state backend together with the other operator state.
+- It only briefly stops the processing of input to mark the buffers, forward
+ the barrier and create the snapshot of the other state.
Review comment:
It reads like a separate step (it's not, right?)
```suggestion
Compared to aligned checkpoints, the operator doesn't need to suspend any of
its inputs.
```
##########
File path: docs/concepts/stateful-stream-processing.md
##########
@@ -242,6 +246,42 @@ updates to that state.
See [Restart Strategies]({% link dev/task_failure_recovery.md
%}#restart-strategies) for more information.
+### Unaligned Checkpointing
Review comment:
Should we mention that this is an experimental feature?
I think it should be a separate statement in the end of the section.
##########
File path: docs/concepts/stateful-stream-processing.md
##########
@@ -242,6 +246,42 @@ updates to that state.
See [Restart Strategies]({% link dev/task_failure_recovery.md
%}#restart-strategies) for more information.
+### Unaligned Checkpointing
+
+Starting with Flink 1.11, checkpointing can also be performed unaligned.
+The basic idea is that checkpoints can overtake all in-flight data as long as
+the in-flight data becomes part of the operator state.
+
+<div style="text-align: center">
+ <img src="{% link fig/stream_unaligning.svg %}" alt="Unaligned
checkpointing" style="width:100%; padding-top:10px; padding-bottom:10px;" />
+</div>
+
+The figure depicts how an operator handle unaligned checkpoint barriers:
+
+- The operator reacts on the first barrier that is stored in input buffers.
+- It immediately forwards the barrier to the downstream operator by adding it
+ to the end of the output buffers.
+- The operator marks all overtaken buffers, which are asynchronously stored in
+ the state backend together with the other operator state.
+- It only briefly stops the processing of input to mark the buffers, forward
+ the barrier and create the snapshot of the other state.
+
+Unaligned checkpointing ensures that barriers are arriving at the sink as fast
+as possible. It's especially suited for application with at least one slow
+moving data path, where alignment times can reach hours. However, since it's
+adding additional I/O pressure to state backends, it doesn't help when the I/O
+to the state backends is the bottleneck. See the more in-depth discussion in
+[ops]({% link ops/state/checkpoints.md %}#unaligned-checkpoints)
+for other limitations.
+
+Note that savepoints will always be aligned.
Review comment:
:+1:
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]