[ 
https://issues.apache.org/jira/browse/FLINK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514663#comment-17514663
 ] 

Matthias Pohl edited comment on FLINK-26742 at 3/30/22, 12:27 PM:
------------------------------------------------------------------

FLINK-26606 and FLINK-26742 essentially have the same cause. We don't mark 
checkpoints as deleted before actually discarding the state. Instead, we just 
remove the entry from the {{StateHandleStore}}. Any subsequent failure couldn't 
be retried because the metadata is gone. I'm closing this issue in favor of 
FLINK-26606 because both can be fixed in the same way:

We have to make it possible to mark the checkpoint for deletion and only remove 
the entry in the {{StateHandleStore}} after the actual deletion was successful.

For the checkpoint subsumption while the job is still running, we just try to 
delete it and keep the Checkpoint metadata in the {{StateHandleStore}} marked 
for deletion if the actual deletion failed. the shutdown logic of the 
{{CompletedCheckpointStore}} will then try to clean the checkpoints once more 
and will retry it in case of failure.


was (Author: mapohl):
FLINK-26606 and FLINK-26742 essentially have the same cause. We don't mark 
checkpoints as deleted before actually discarding the state but just remove the 
entry from the {{StateHandleStore}}. Any subsequent failure won't be detected 
because the metadata is gone. I'm closing this issue in favor of FLINK-26606 
because both are fixed in the same way:

We have to make it possible to mark the checkpoint for deletion and only remove 
the entry in the {{StateHandleStore}} after the actual deletion was successful.

For the checkpoint subsumption while the job is still running, we just try to 
delete it and keep the Checkpoint metadata in the {{StateHandleStore}} marked 
for deletion if the actual deletion failed. the shutdown logic of the 
{{CompletedCheckpointStore}} will then try to clean the checkpoints once more 
and will retry it in case of failure.

> DefaultCompletedCheckpointStore.shutdown does not clean the checkpoints 
> atomically
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-26742
>                 URL: https://issues.apache.org/jira/browse/FLINK-26742
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Matthias Pohl
>            Priority: Critical
>
> The {{DefaultCompletedCheckpointStore.shutdown}} removes the Checkpoint entry 
> from the {{StateHandleStore}} and runs the actual cleanup of the checkpoint 
> after it got removed. That means that the data is lost if there's an error 
> while discarding the {{CompletedCheckpoint}} which, as a consequence, doesn't 
> get picked up anymore during retry.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to