infoverload commented on code in PR #516:
URL: https://github.com/apache/flink-web/pull/516#discussion_r864938905
##########
_posts/2022-04-01-tidying-snapshots-up.md:
##########
@@ -0,0 +1,213 @@
+---
+layout: post
+title: "Tidying up snapshots"
+date: 2022-04-01T00:00:00.000Z
+authors:
+- dwysakowicz:
+ name: "Dawid Wysakowicz"
+ twitter: "dwysakowicz"
+
+excerpt: TODO
+
+---
+
+{% toc %}
+
+Over the years, Flink has become a well established project in the data
streaming domain and a
+mature project requires a slight shift of priorities from thinking purely
about new features
+towards caring more about stability and operational simplicity. The Flink
community has tried to address
+some known friction points over the last couple of releases, which includes
improvements to the
+snapshotting process.
+
+Flink 1.13 was the first release we announced [unaligned
checkpoints]({{site.DOCS_BASE_URL}}flink-docs-release-1.15/docs/concepts/stateful-stream-processing/#unaligned-checkpointing)
to be production-ready and
+encourage people to use them if their jobs are backpressured to a point where
it causes issues for
+checkpoints. It was also the release where we [unified the binary format of
savepoints](/news/2021/05/03/release-1.13.0.html#switching-state-backend-with-savepoints)
across all
+different state backends, which allows for stateful switching of those. More
on that a bit later.
+
+The next release, 1.14 also brought additional improvements. As an alternative
and as a complement
+to unaligned checkpoints we introduced a feature, we called ["buffer
debloating"](/news/2021/09/29/release-1.14.0.html#buffer-debloating). It is
build
+around the concept of automatically adjusting the amount of in-flight data
that needs to be aligned
+while snapshotting. Another long-standing problem, we fixed, was that from
1.14 onwards it is
+possible to [continue checkpointing even if there are finished
tasks](/news/2021/09/29/release-1.14.0.html#checkpointing-and-bounded-streams)
in ones jobgraph.
+
+The latest 1.15 release is no different, that we still want to pay attention
to what makes it hard
+to operate Flink's cluster. In that release we tackled the problem that
savepoints can be expensive
+to take and restore from if taken for a very large state stored in the RocksDB
state backend. In
+order to circumvent the issue we had seen users leveraging the externalized,
incremental checkpoints
+instead of savepoints in order to benefit from the native RocksDB format. To
make it more
+straightforward, we incorporated that approach and made it possible to take
savepoints in that
+native state backend specific format, while still maintaining some savepoints
characteristics, which
+makes it possible to relocate such a savepoint.
+
+Another issue we've seen with externalized checkpoints is that it has not been
clear who owns the
+checkpoint files. This is especially problematic when it comes to incremental
RocksDB checkpoints
+where you can easily end up in a situation you do not know which checkpoints
depend on which files
+and thus not being able to clean those files up. To solve this issue we added
explicit restore
+modes:
+CLAIM, NO_CLAIM and LEGACY (for backwards compatibility) which clearly define
if Flink should take
+care of cleaning up the snapshot or should it remain in users responsibility.
+
+# Restore mode
+
+The `Restore Mode` determines who takes ownership of the files that make up a
savepoint or an
+externalized checkpoints after restoring it. Snapshots, which are either
checkpoints or savepoints
+in this context, can be owned either by a user or Flink itself. If a snapshot
is owned by a user,
+Flink will not delete its files. What is even more important, Flink can not
depend on the existence
+of the files from such a snapshot, as it might be deleted outside of Flink's
control.
+
+To begin with let's see how did it look so far and what problems it may pose.
We left the old
+behaviour which is available if you pick the `LEGACY` mode.
Review Comment:
```suggestion
## The new restore modes
The `Restore Mode` determines who takes ownership of the files that make up
savepoints or
externalized checkpoints after they are restored. Snapshots, which are
either checkpoints or savepoints
in this context, can be owned either by a user or Flink itself. If a
snapshot is owned by a user,
Flink will not delete its files and will not depend on the existence
of such files since it might be deleted outside of Flink's control.
Let's dig further into each of the modes.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]