[
https://issues.apache.org/jira/browse/FLINK-10855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711205#comment-16711205
]
vinoyang commented on FLINK-10855:
----------------------------------
[~yunta] There are many factors that take into account that checkpoints cannot
become CompletedCheckpoint. Just as Till's description of failure from
PendingCheckpoint is just one of them.
My personal thought is whether we can introduce a regularly executed
CheckpointTrashCleaner that will compare the metadata of the checkpoint and
determine if it has failed permanently based on checkpoint Timeout.
What do you think? [~till.rohrmann]
> CheckpointCoordinator does not delete checkpoint directory of late/failed
> checkpoints
> -------------------------------------------------------------------------------------
>
> Key: FLINK-10855
> URL: https://issues.apache.org/jira/browse/FLINK-10855
> Project: Flink
> Issue Type: Bug
> Components: State Backends, Checkpointing
> Affects Versions: 1.5.5, 1.6.2, 1.7.0
> Reporter: Till Rohrmann
> Assignee: vinoyang
> Priority: Major
>
> In case that an acknowledge checkpoint message is late or a checkpoint cannot
> be acknowledged, we discard the subtask state in the
> {{CheckpointCoordinator}}. What's not happening in this case is that we
> delete the parent directory of the checkpoint. This only happens when we
> dispose a {{PendingCheckpoint#dispose}}.
> Due to this behaviour it can happen that a checkpoint fails (e.g. a task not
> being ready) and we delete the checkpoint directory. Next another task writes
> its checkpoint data to the checkpoint directory (thereby creating it again)
> and sending an acknowledge message back to the {{CheckpointCoordinator}}. The
> {{CheckpointCoordinator}} will realize that there is no longer a
> {{PendingCheckpoint}} and will discard the sub task state. This will remove
> the state files from the checkpoint directory but will leave the checkpoint
> directory untouched.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)