[GitHub] [flink] AHeise commented on a change in pull request #12722: [FLINK-18064][docs] Added unaligned checkpointing to docs.

GitBox Mon, 22 Jun 2020 04:56:24 -0700


AHeise commented on a change in pull request #12722:
URL: https://github.com/apache/flink/pull/12722#discussion_r443499800




##########
File path: docs/concepts/stateful-stream-processing.md
##########
@@ -123,7 +123,7 @@ provided by Flink's connectors.
 snapshots, we use the words *snapshot* and *checkpoint* interchangeably. Often
 we also use the term *snapshot* to mean either *checkpoint* or *savepoint*.
 
-### Checkpointing
+### Aligned Checkpointing
 
 The central part of Flink's fault tolerance mechanism is drawing consistent

Review comment:
       Good question. I briefly looked at it and it seems as if you are right. 
I pinged @StephanEwen for clarification and he responded that it is indeed 
closer, but there are still some difference:
   
   > Flink checkpoints start at sources, propagate through a DAG and persist 
in-flight as needed
   Chandy-Lamport assumes no DAG, starts everywhere at the same time (think: 
RPC goes to every operators) and all operators log until they saw all markers.
   
   I'd probably leave the reference as is and add a clarifying line to 
unaligned section.

##########
File path: docs/concepts/stateful-stream-processing.md
##########
@@ -242,6 +246,42 @@ updates to that state.
 See [Restart Strategies]({% link dev/task_failure_recovery.md
 %}#restart-strategies) for more information.
 
+### Unaligned Checkpointing
+
+Starting with Flink 1.11, checkpointing can also be performed unaligned.
+The basic idea is that checkpoints can overtake all in-flight data as long as 
+the in-flight data becomes part of the operator state.
+
+<div style="text-align: center">
+  <img src="{% link fig/stream_unaligning.svg %}" alt="Unaligned 
checkpointing" style="width:100%; padding-top:10px; padding-bottom:10px;" />
+</div>
+
+The figure depicts how an operator handle unaligned checkpoint barriers:
+
+- The operator reacts on the first barrier that is stored in input buffers.
+- It immediately forwards the barrier to the downstream operator by adding it 
+  to the end of the output buffers.
+- The operator marks all overtaken buffers, which are asynchronously stored in
+  the state backend together with the other operator state.
+- It only briefly stops the processing of input to mark the buffers, forward 
+  the barrier and create the snapshot of the other state.

Review comment:
       Yes, I have removed the bullet and made it a summarizing sentence.

##########
File path: docs/concepts/stateful-stream-processing.md
##########
@@ -242,6 +246,42 @@ updates to that state.
 See [Restart Strategies]({% link dev/task_failure_recovery.md
 %}#restart-strategies) for more information.
 
+### Unaligned Checkpointing

Review comment:
       My understanding was that this document is rather an expanded glossary 
and talks about the concept and not the implementation. Thus, I'd leave the 
implementation state out of this place. The ops link will directly say that 
it's experimental in 1.11. 

##########
File path: docs/ops/state/checkpoints.md
##########
@@ -113,4 +113,50 @@ above).
 $ bin/flink run -s :checkpointMetaDataPath [:runArgs]
 {% endhighlight %}
 
+### Unaligned checkpoints
+
+Starting with Flink 1.11, checkpoints can be unaligned (experimental). 
+[Unaligned checkpoints]({% link concepts/stateful-stream-processing.md
+%}#unaligned-checkpointing) contain in-flight data (i.e., data stored in
+buffers) as part of the checkpoint state, which allows checkpoint barriers to
+overtake these buffers. Thus, the checkpoint duration becomes independent of 
the
+current throughput as checkpoint barriers are effectively not embedded into 
+the stream of data anymore.
+
+You should use unaligned checkpoints if your checkpointing durations are very
+high due to back-pressure. Then, checkpointing time becomes mostly
+independent of the end-to-end latency. Be aware unaligned checkpointing
+adds to I/O to the state backends, so you shouldn't use it when the I/O to
+the state backend is actually the bottleneck during checkpointing.
+
+We flagged unaligned checkpoints as experimental as it currently has the
+following shortcomings:
+
+- You cannot rescale from unaligned checkpoints. You have to take a savepoint 
+before rescaling. Savepoints are always aligned independent of the alignment

Review comment:
       Can you change the job graph with current checkpoints? I was always 
assuming that you need savepoints.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] AHeise commented on a change in pull request #12722: [FLINK-18064][docs] Added unaligned checkpointing to docs.

Reply via email to