[
https://issues.apache.org/jira/browse/FLINK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514663#comment-17514663
]
Matthias Pohl edited comment on FLINK-26742 at 3/30/22, 12:27 PM:
------------------------------------------------------------------
FLINK-26606 and FLINK-26742 essentially have the same cause. We don't mark
checkpoints as deleted before actually discarding the state. Instead, we just
remove the entry from the {{StateHandleStore}}. Any subsequent failure couldn't
be retried because the metadata is gone. I'm closing this issue in favor of
FLINK-26606 because both can be fixed in the same way:
We have to make it possible to mark the checkpoint for deletion and only remove
the entry in the {{StateHandleStore}} after the actual deletion was successful.
For the checkpoint subsumption while the job is still running, we just try to
delete it and keep the Checkpoint metadata in the {{StateHandleStore}} marked
for deletion if the actual deletion failed. the shutdown logic of the
{{CompletedCheckpointStore}} will then try to clean the checkpoints once more
and will retry it in case of failure.
was (Author: mapohl):
FLINK-26606 and FLINK-26742 essentially have the same cause. We don't mark
checkpoints as deleted before actually discarding the state but just remove the
entry from the {{StateHandleStore}}. Any subsequent failure won't be detected
because the metadata is gone. I'm closing this issue in favor of FLINK-26606
because both are fixed in the same way:
We have to make it possible to mark the checkpoint for deletion and only remove
the entry in the {{StateHandleStore}} after the actual deletion was successful.
For the checkpoint subsumption while the job is still running, we just try to
delete it and keep the Checkpoint metadata in the {{StateHandleStore}} marked
for deletion if the actual deletion failed. the shutdown logic of the
{{CompletedCheckpointStore}} will then try to clean the checkpoints once more
and will retry it in case of failure.
> DefaultCompletedCheckpointStore.shutdown does not clean the checkpoints
> atomically
> ----------------------------------------------------------------------------------
>
> Key: FLINK-26742
> URL: https://issues.apache.org/jira/browse/FLINK-26742
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.15.0
> Reporter: Matthias Pohl
> Priority: Critical
>
> The {{DefaultCompletedCheckpointStore.shutdown}} removes the Checkpoint entry
> from the {{StateHandleStore}} and runs the actual cleanup of the checkpoint
> after it got removed. That means that the data is lost if there's an error
> while discarding the {{CompletedCheckpoint}} which, as a consequence, doesn't
> get picked up anymore during retry.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)