[
https://issues.apache.org/jira/browse/FLINK-13497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16898775#comment-16898775
]
Biao Liu commented on FLINK-13497:
----------------------------------
The thread model of {{CheckpointCoordinator}} seems to be a bit messy. Here we
missed two necessary synchronizations.
# Synchronization between different checkpoints. That's the reason of why
{{CheckpointFailureManager}} has already decided to {{failGlobal}} but other
checkpoints could succeed at the same time. We might need to re-think the
thread model here. [~yunta] gave a work-around way.
# Synchronization between {{CheckpointCoordinator}} and {{ExecutionGraph}}.
That's caused by asynchronous {{failGlobal}}. So I suggest using a work-around
way canceling task with {{ExecutionAttemptID}} instead. That's a kind of weak
synchronization.
> Checkpoints can complete after CheckpointFailureManager fails job
> -----------------------------------------------------------------
>
> Key: FLINK-13497
> URL: https://issues.apache.org/jira/browse/FLINK-13497
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.9.0, 1.10.0
> Reporter: Till Rohrmann
> Priority: Critical
> Fix For: 1.9.0
>
>
> I think that we introduced with FLINK-12364 an inconsistency wrt to job
> termination a checkpointing. In FLINK-9900 it was discovered that checkpoints
> can complete even after the {{CheckpointFailureManager}} decided to fail a
> job. I think the expected behaviour should be that we fail all pending
> checkpoints once the {{CheckpointFailureManager}} decides to fail the job.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)