[ https://issues.apache.org/jira/browse/FLINK-5667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15842420#comment-15842420 ]
Ufuk Celebi commented on FLINK-5667: ------------------------------------ Very god catch! :-) > Possible state data loss when task fails while checkpointing > ------------------------------------------------------------ > > Key: FLINK-5667 > URL: https://issues.apache.org/jira/browse/FLINK-5667 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing > Affects Versions: 1.2.0, 1.3.0 > Reporter: Till Rohrmann > Assignee: Till Rohrmann > Priority: Blocker > Fix For: 1.2.0, 1.3.0 > > > It is possible that Flink loses state data when a {{Task}} fails while a > checkpoint is being drawn. The scenario is the following: > Flink has finished the synchronous checkpointing part and starts the > asynchronous part by creating and submitting a {{AsyncCheckpointRunnable}} to > an {{Executor}}. This runnable is also registered at the closeable registry. > If the {{Task}} now fails before the {{AsyncCheckpointRunnable}} has > completed, it will be closed due to being registered in the closeable > registry. The closing operation will discard all state handles and cancel all > runnable state futures. However, it will not stop the runnable from sending > an acknowledge message to the {{CheckpointCoordinator}}. > If this message completes the pending checkpoint, then this checkpoint will > be transformed into a {{CompletedCheckpoint}} which is faulty (some of the > data has already been deleted). Depending on Flink's configuration, this will > discard older completed checkpoints and thus we will have state data loss. -- This message was sent by Atlassian JIRA (v6.3.4#6332)