[
https://issues.apache.org/jira/browse/FLINK-35178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841029#comment-17841029
]
Yanfei Lei commented on FLINK-35178:
------------------------------------
[~elon]
> the result was the same?
Does it mean that empty shared/ and taskowned/ directories still exist? In our
practice, these folders are generally managed with the help of external
platforms.
> {{{}state.checkpoints.num-retained=3{}}}, the older job's checkpoint
> versions are not being discarded even if they are not referenced.
The chk-x information is recorded in the job manager. When the job is restored,
those information is lost. The job manager only knows the chk-x you specified,
but not the other chk-(x-1), chk-(x-2), so they are not deleted.
> Checkpoint CLAIM mode does not fully control snapshot ownership
> ---------------------------------------------------------------
>
> Key: FLINK-35178
> URL: https://issues.apache.org/jira/browse/FLINK-35178
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.18.0
> Reporter: elon_X
> Priority: Major
> Attachments: image-2024-04-20-14-51-21-062.png,
> image-2024-04-22-15-16-02-381.png
>
>
> When I enable incremental checkpointing, and the task fails or is canceled
> for some reason, restarting the task from {{-s checkpoint_path}} with
> {{restoreMode CLAIM}} allows the Flink job to recover from the last
> checkpoint, it just discards the previous checkpoint.
> Then I found that this leads to the following two cases:
> 1. If the new checkpoint_x meta file does not reference files in the shared
> directory under the previous jobID:
> the shared and taskowned directories from the previous Job will be left as
> empty directories, and these two directories will persist without being
> deleted by Flink. !image-2024-04-20-14-51-21-062.png!
> 2. If the new checkpoint_x meta file references files in the shared directory
> under the previous jobID:
> the chk-(x-1) from the previous job will be discarded, but there will still
> be state data in the shared directory under that job, which might persist for
> a relatively long time. Here arises the question: the previous job is no
> longer running, and it's unclear whether users should delete the state data.
> Deleting it could lead to errors when the task is restarted, as the meta
> might reference files that can no longer be found; this could be confusing
> for users.
>
> The potential solution might be to reuse the previous job's jobID when
> restoring from {{{}-s checkpoint_path{}}}, or to add a new parameter that
> allows users to specify the jobID they want to recover from;
>
> Please correct me if there's anything I've misunderstood.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)