[GitHub] [flink] AHeise commented on a change in pull request #12722: [FLINK-18064][docs] Added unaligned checkpointing to docs.

GitBox Mon, 22 Jun 2020 13:45:03 -0700


AHeise commented on a change in pull request #12722:
URL: https://github.com/apache/flink/pull/12722#discussion_r443815132




##########
File path: docs/ops/state/checkpoints.md
##########
@@ -113,4 +113,50 @@ above).
 $ bin/flink run -s :checkpointMetaDataPath [:runArgs]
 {% endhighlight %}
 
+### Unaligned checkpoints
+
+Starting with Flink 1.11, checkpoints can be unaligned (experimental). 
+[Unaligned checkpoints]({% link concepts/stateful-stream-processing.md
+%}#unaligned-checkpointing) contain in-flight data (i.e., data stored in
+buffers) as part of the checkpoint state, which allows checkpoint barriers to
+overtake these buffers. Thus, the checkpoint duration becomes independent of 
the
+current throughput as checkpoint barriers are effectively not embedded into 
+the stream of data anymore.
+
+You should use unaligned checkpoints if your checkpointing durations are very
+high due to backpressure. Then, checkpointing time becomes mostly
+independent of the end-to-end latency. Be aware unaligned checkpointing
+adds to I/O to the state backends, so you shouldn't use it when the I/O to
+the state backend is actually the bottleneck during checkpointing.
+
+We flagged unaligned checkpoints as experimental as it currently has the
+following limitations:
+
+- You cannot rescale from unaligned checkpoints. You have to take a savepoint 
+before rescaling. Savepoints are always aligned independent of the alignment
+setting of checkpoints.
+- Flink currently does not support concurrent unaligned checkpoints. However, 
+due to the more predictable and shorter checkpointing times, concurrent 
+checkpoints might not be needed at all.
+- Unaligned checkpoints may produce incorrect results for the following 
reasons:
+
+Currently, Flink generates the watermark as a first step of recovery instead 
of 
+storing the latest watermark in the operators to ease rescaling. In unaligned 
+checkpoints, that means on recovery, **Flink generates watermarks after it 
+restores in-flight data**. If your pipeline uses an **operator that applies the
+latest watermark on each record**, it will produce **incorrect results** 
during 
+recovery if the watermark is not directly or indirectly part of the operator 
+state. Thus, **SQL OVER operator should not be used with unaligned
+checkpoints**, while window operators are safe to use. The workaround is to
+store the watermark in the operator state. If rescaling may occur, watermarks
+should be stored per key-group in a union-state. We mostly likely will
+implement this approach as a general solution (didn't make it into Flink 
+1.11.0).

Review comment:
       I can tone it done, but basically we are breaking with the old 
assumption that watermarks don't need to be stored at the operator because they 
are sent first.
   I'm especially referring to the OverITCases, which use a weird way to inject 
watermarks and logically should persist them. But now that I'm thinking about 
it, it's more a matter of the test setup itself.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] AHeise commented on a change in pull request #12722: [FLINK-18064][docs] Added unaligned checkpointing to docs.

Reply via email to