[
https://issues.apache.org/jira/browse/FLINK-26606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthias Pohl updated FLINK-26606:
----------------------------------
Priority: Critical (was: Major)
> CompletedCheckpoints that failed to be discarded are not stored in the
> CompletedCheckpointStore
> -----------------------------------------------------------------------------------------------
>
> Key: FLINK-26606
> URL: https://issues.apache.org/jira/browse/FLINK-26606
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing, Runtime / Coordination
> Affects Versions: 1.15.0
> Reporter: Matthias Pohl
> Priority: Critical
>
> We introduced a repeatable per-job cleanup after the job reached a
> globally-terminated state. It also tries to clean up the
> {{CompletedCheckpointStore}}. But we missed one code path where
> {{CompletedCheckpoints}} are tried to be discarded in the
> {{CheckpointsCleaner}}. The {{CompletedCheckpointStore}} does not hold any
> references to these {{CompletedCheckpoints}} anymore. The shutdown at the end
> is not able to clean these checkpoints up.
> We should not remove the {{CompletedCheckpoints}} from the
> {{CompletedCheckpointStore}} if the deletion failed. This would enable us to
> retry deleting these artifacts at the end of the job and consider them in the
> retryable cleanup as well.
> The documentation was updated to cover this issue. Fixing this issue should
> also include removing the corresponding paragraph from the documentation (see
> [related flink-docs PR|https://github.com/apache/flink/pull/19058]).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)