[
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103518#comment-17103518
]
Steven Zhen Wu edited comment on FLINK-17571 at 5/9/20, 10:59 PM:
------------------------------------------------------------------
[~pnowojski] what is the usage of the remove command?
Please correct my understanding on incremental checkpoint.
* It removes S3 files when reference count reaching zero. Normally, there
shouldn't be orphaned checkpoint files lingering around. Maybe in some rare
cases, reference count based cleanup didn't happen or succeed. so there is a
small chance of orphaned files here.
* We don't always restore from external checkpoint and continue the same
checkpoint lineage (with incremental checkpoint and reference count). E.g. we
can restore from a savepoint or empty state. Then those abandoned checkpoint
lineages can leave significant garbage behind.
here is what I am thinking about the GC
# trace from root of retained external checkpoints to find all live files
# Find all files in S3 bucket/prefix. I heard S3 can send daily report and we
don't have to list objects
# find the diff and remove the non live files (with some safety threshold like
older than 30 days)
was (Author: stevenz3wu):
[~pnowojski] what is the usage of the remove command?
Please correct my understanding on incremental checkpoint.
* It removes S3 files when reference count reaching zero. Normally, there
shouldn't be orphaned checkpoint files lingering around. Maybe in some rare
cases, reference count based cleanup didn't happen or succeed. so there is a
small chance of orphaned files here.
* We don't always restore from external checkpoint and continue the same
checkpoint lineage. E.g. we can restore from a savepoint or empty state. Then
those abandoned checkpoint lineages can leave significant garbage behind.
here is what I am thinking about the GC
# trace from root of retained external checkpoints to find all live files
# Find all files in S3 bucket/prefix. I heard S3 can send daily report and we
don't have to list objects
# find the diff and remove the non live files (with some safety threshold like
older than 30 days)
> A better way to show the files used in currently checkpoints
> ------------------------------------------------------------
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
> Issue Type: New Feature
> Components: Runtime / Checkpointing
> Reporter: Congxian Qiu(klion26)
> Priority: Major
>
> Inspired by the
> [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]
> Currently, there are [three types of
> directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]
> for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be
> deleted safely, but users can't delete the files in the SHARED directory
> safely(the files may be created a long time ago).
> I think it's better to give users a better way to know which files are
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a
> feature.
> {{./bin/flink checkpoint list $checkpointDir # list all the files used in
> checkpoint}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)