[
https://issues.apache.org/jira/browse/FLINK-26606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514688#comment-17514688
]
Matthias Pohl commented on FLINK-26606:
---------------------------------------
There's a problem with this approach: {{StateHandleStore.getAllAndLock}} tries
to retrieve all Checkpoints currently stored in the {{StateHandleStore}}. This
is used during recovery. The {{CompletedCheckpoints}} are then used as an input
parameter for the recovered {{CompletedCheckpointStore}}. The ZK implementation
of this method will enter a retry loop if it detects a checkpoint that is
marked for deletion (because of the way
[ZooKeeperStateHandleStore.getAndLock(String)|https://github.com/apache/flink/blob/c3df4c3f1f868d40e1e70404bea41b7a007e8b08/flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java#L411]
is implemented).
In contrast, the k8s implementation will just ignored those entries.
> CompletedCheckpoints that failed to be discarded are not stored in the
> CompletedCheckpointStore
> -----------------------------------------------------------------------------------------------
>
> Key: FLINK-26606
> URL: https://issues.apache.org/jira/browse/FLINK-26606
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing, Runtime / Coordination
> Affects Versions: 1.15.0
> Reporter: Matthias Pohl
> Priority: Critical
>
> We introduced a repeatable per-job cleanup after the job reached a
> globally-terminated state. It also tries to clean up the
> {{CompletedCheckpointStore}}. But we missed one code path where
> {{CompletedCheckpoints}} are tried to be discarded in the
> {{CheckpointsCleaner}}. The {{CompletedCheckpointStore}} does not hold any
> references to these {{CompletedCheckpoints}} anymore. The shutdown at the end
> is not able to clean these checkpoints up.
> We should not remove the {{CompletedCheckpoints}} from the
> {{CompletedCheckpointStore}} if the deletion failed. This would enable us to
> retry deleting these artifacts at the end of the job and consider them in the
> retryable cleanup as well.
> The documentation was updated to cover this issue. Fixing this issue should
> also include removing the corresponding paragraph from the documentation (see
> [related flink-docs PR|https://github.com/apache/flink/pull/19058]).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)