Steven Zhen Wu created FLINK-27101:
--------------------------------------
Summary: Periodically break the chain of incremental checkpoint
Key: FLINK-27101
URL: https://issues.apache.org/jira/browse/FLINK-27101
Project: Flink
Issue Type: New Feature
Components: Runtime / Checkpointing
Reporter: Steven Zhen Wu
Incremental checkpoint is almost a must for large-state jobs. It greatly
reduces the bytes uploaded to DFS per checkpoint. However, there are a few
implications from incremental checkpoint that are problematic for production
operations. Will use S3 as an example DFS in the rest of description.
1. Because there is no way to deterministically know how far back the
incremental checkpoint can refer to files uploaded to S3, it is very difficult
to set S3 bucket/object TTL. In one application, we have observed Flink
checkpoint referring to files uploaded over 6 months ago. S3 TTL can corrupt
the Flink checkpoints.
S3 TTL is important for a few reasons
- purge orphaned files (like external checkpoints from previous deployments) to
keep the storage cost in check. This problem can be addressed by implementing
proper garbage collection (similar to JVM) by traversing the retained
checkpoints from all jobs and traverse the file references. But that is an
expensive solution from engineering cost perspective.
- Security and privacy. E.g., there may be requirement that Flink state can't
keep the data for more than some duration threshold (hours/days/weeks).
Application is expected to purge keys to satisfy the requirement. However, with
incremental checkpoint and how deletion works in RocksDB, it is hard to set S3
TTL to purge S3 files. Even though those old S3 files don't contain live keys,
they may still be referrenced by retained Flink checkpoints.
2. Occasionally, corrupted checkpoint files (on S3) are observed. As a result,
restoring from checkpoint failed. With incremental checkpoint, it usually
doesn't help to try other older checkpoints, because they may refer to the same
corrupted file. It is unclear whether the corruption happened before or during
S3 upload. This risk can be mitigated with periodical savepoints.
It all boils down to periodical full snapshot (checkpoint or savepoint) to
deterministically break the chain of incremental checkpoints. Search the jira
history, the behavior that FLINK-23949 [1] trying to fix is actually close to
what we would need here.
There are a few options
1. Periodically trigger savepoints (via control plane). This is actually not a
bad practice and might be appealing to some people. The problem is that it
requires a job deployment to break the chain of incremental checkpoint.
periodical job deployment may sound hacky. If we make the behavior of full
checkpoint after a savepoint (fixed in FLINK-23949) configurable, it might be
an acceptable compromise. The benefit is that no job deployment is required
after savepoints.
2. Build the feature in Flink incremental checkpoint. Periodically (with some
cron style config) trigger a full checkpoint to break the incremental chain. If
the full checkpoint failed (due to whatever reason), the following checkpoints
should attempt full checkpoint as well until one successful full checkpoint is
completed.
3. For the security/privacy requirement, the main thing is to apply compaction
on the deleted keys. That could probably avoid references to the old files. Is
there any RocksDB compation can achieve full compaction of removing old delete
markers. Recent delete markers are fine
[1] https://issues.apache.org/jira/browse/FLINK-23949
--
This message was sent by Atlassian Jira
(v8.20.1#820001)