[ https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133854#comment-17133854 ]
Yun Tang commented on FLINK-18263: ---------------------------------- I think this future depends on how we give definition for ‘{{FINISHED}}’ job status. If all tasks are finished, why we still need to keep that checkpoint as that job would already complete its life-cycle. CC [~zjwang], [~zhuzh] as they might give more thoughts on job status definition. As you mentioned, we could rewind a job (that reached the FINISHED state) to a previous checkpoint if retained on FINISHED status. However, the time of last checkpoint would not be so accurate, I don't know how much this could contribute and manual savepoint might be more useful in your scenario. > Allow external checkpoints to be persisted even when the job is in "Finished" > state. > ------------------------------------------------------------------------------------ > > Key: FLINK-18263 > URL: https://issues.apache.org/jira/browse/FLINK-18263 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing > Reporter: Mark Cho > Priority: Major > Labels: pull-request-available > > Currently, `execution.checkpointing.externalized-checkpoint-retention` > configuration supports two options: > - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED > and SUSPENDED state. > - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in > FAILED, SUSPENDED, and CANCELED state. > This gives us control over the retention of externalized checkpoints in all > terminal state of a job, except for the FINISHED state. > If the job ends up in "FINISHED" state, externalized checkpoints will be > automatically cleaned up and there currently is no config that will ensure > that these externalized checkpoints to be persisted. > I found an old Jira ticket FLINK-4512 where this was discussed. I think it > would be helpful to have a config that can control the retention policy for > FINISHED state as well. > - This can be useful for cases where we want to rewind a job (that reached > the FINISHED state) to a previous checkpoint. > - When we use externalized checkpoints, we want to fully delegate the > checkpoint clean-up to an external process in all job states (without > cherrypicking FINISHED state to be cleaned up by Flink). > We have a quick fix working in our fork where we've changed > `ExternalizedCheckpointCleanup` enum: > {code:java} > RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED) > RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED) > RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED) > {code} > Since this change requires changes to multiple components (e.g. config > values, REST API, Web UI, etc), I wanted to get the community's thoughts > before I invest more time in my quick fix PR (which currently only contains > minimal change to get this working). -- This message was sent by Atlassian Jira (v8.3.4#803005)