[ 
https://issues.apache.org/jira/browse/FLINK-26606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514688#comment-17514688
 ] 

Matthias Pohl commented on FLINK-26606:
---------------------------------------

There's a problem with this approach: {{StateHandleStore.getAllAndLock}} tries 
to retrieve all Checkpoints currently stored in the {{StateHandleStore}}. This 
is used during recovery. The {{CompletedCheckpoints}} are then used as an input 
parameter for the recovered {{CompletedCheckpointStore}}. The ZK implementation 
of this method will enter a retry loop if it detects a checkpoint that is 
marked for deletion (because of the way 
[ZooKeeperStateHandleStore.getAndLock(String)|https://github.com/apache/flink/blob/c3df4c3f1f868d40e1e70404bea41b7a007e8b08/flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java#L411]
 is implemented).

In contrast, the k8s implementation will just ignored those entries.

> CompletedCheckpoints that failed to be discarded are not stored in the 
> CompletedCheckpointStore
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-26606
>                 URL: https://issues.apache.org/jira/browse/FLINK-26606
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Matthias Pohl
>            Priority: Critical
>
> We introduced a repeatable per-job cleanup after the job reached a 
> globally-terminated state. It also tries to clean up the 
> {{CompletedCheckpointStore}}. But we missed one code path where 
> {{CompletedCheckpoints}} are tried to be discarded in the 
> {{CheckpointsCleaner}}. The {{CompletedCheckpointStore}} does not hold any 
> references to these {{CompletedCheckpoints}} anymore. The shutdown at the end 
> is not able to clean these checkpoints up.
> We should not remove the {{CompletedCheckpoints}} from the 
> {{CompletedCheckpointStore}} if the deletion failed. This would enable us to 
> retry deleting these artifacts at the end of the job and consider them in the 
> retryable cleanup as well.
> The documentation was updated to cover this issue. Fixing this issue should 
> also include removing the corresponding paragraph from the documentation (see 
> [related flink-docs PR|https://github.com/apache/flink/pull/19058]).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to