[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889447#comment-15889447 ]
ASF GitHub Bot commented on FLINK-4810: --------------------------------------- Github user shixiaogang commented on a diff in the pull request: https://github.com/apache/flink/pull/3334#discussion_r103605788 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java --- @@ -428,6 +450,9 @@ CheckpointTriggerResult triggerCheckpoint( catch (Throwable t) { int numUnsuccessful = numUnsuccessfulCheckpointsTriggers.incrementAndGet(); LOG.warn("Failed to trigger checkpoint (" + numUnsuccessful + " consecutive failed attempts so far)", t); + if(numUnsuccessful > maxUnsuccessfulCheckpoints) { --- End diff -- Here the counter records the total number of failed attempts. Since a streaming job is intended to run a quite long time, the number of failed attempts will eventually exceed the limit. We should use a different counter here which is reset once a pending checkpoint successfully completes. > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > ------------------------------------------------------------------------------------ > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing > Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)