Mark Cho created FLINK-18263:
--------------------------------
Summary: Allow external checkpoints to be persisted even when the
job is in "Finished" state.
Key: FLINK-18263
URL: https://issues.apache.org/jira/browse/FLINK-18263
Project: Flink
Issue Type: Improvement
Components: Runtime / Checkpointing
Reporter: Mark Cho
Currently, `execution.checkpointing.externalized-checkpoint-retention`
configuration supports two options:
- `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED
and SUSPENDED state.
- `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED,
SUSPENDED, and CANCELED state.
This gives us control over the retention of externalized checkpoints in all
terminal state of a job, except for the FINISHED state.
If the job ends up in "FINISHED" state, externalized checkpoints will be
automatically cleaned up and there currently is no config that will ensure that
these externalized checkpoints to be persisted.
I found an old Jira ticket FLINK-4512 where this was discussed. I think it
would be helpful to have a config that can control the retention policy for
FINISHED state as well.
- This can be useful for cases where we want to rewind a job (that reached the
FINISHED state) to a previous checkpoint.
- When we use externalized checkpoints, we want to fully delegate the
checkpoint clean-up to an external process in all job states (without
cherrypicking FINISHED state to be cleaned up by Flink).
We have a quick fix working in our fork where we've changed
`ExternalizedCheckpointCleanup` enum:
{code:java}
RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED)
RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED)
RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED)
{code}
Since this change requires changes to multiple components (e.g. config values,
REST API, Web UI, etc), I wanted to get the community's thoughts before I
invest more time in my quick fix PR (which currently only contains minimal
change to get this working).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)