[GitHub] [flink] StefanRRichter commented on a change in pull request #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

GitBox Fri, 17 May 2019 04:36:49 -0700

StefanRRichter commented on a change in pull request #8322: [FLINK-12364] 
Introduce a CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#discussion_r285086847


 ##########
 File path: 
flink-end-to-end-tests/flink-streaming-kafka-test-base/src/main/java/org/apache/flink/streaming/kafka/test/base/KafkaExampleUtil.java
 ##########
 @@ -45,6 +45,7 @@ public static StreamExecutionEnvironment 
prepareExecutionEnv(ParameterTool param
                env.getConfig().disableSysoutLogging();
                
env.getConfig().setRestartStrategy(RestartStrategies.fixedDelayRestart(4, 
10000));
                env.enableCheckpointing(5000); // create a checkpoint every 5 
seconds
+               
env.getCheckpointConfig().setTolerableCheckpointFailureNumber(Integer.MAX_VALUE);
 
 Review comment:
   After a quick comparison back with master and 1.7 I think there is at least 
a problem with the `DECLINED` case. It was treated similar to e.g. subsumed and 
was never leading to a job failure. That also makes sense because this cause is 
just existing because the JM can already start triggering checkpoints before 
all tasks are running. This is something that is currently (unfortunately) to 
expect and that should not lead to a failover because it can happen regularly 
in the beginning of a job. Wdyt? If you agree let's also double-check the other 
cases one more time.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [flink] StefanRRichter commented on a change in pull request #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

Reply via email to