[ https://issues.apache.org/jira/browse/FLINK-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16571889#comment-16571889 ]
vinoyang commented on FLINK-10074: ---------------------------------- Hi [~thw] and [~till.rohrmann] : I think it depends on our purpose. I personally think that we mainly want to prevent a job from completing checkpoint for a long time, because in this case, even if we let the job continue to run, once the job fails for other reasons, after the recovery It will generate long-term rollback consumption. Here "long time" should reflect a kind of coherence, so I personally think that we can make a user to set a threshold to specify how many consecutive failures, we let this job fail to recover. In the case that the threshold is not reached, we can clear the counter once the checkpoint can be successfully completed. But there is one more thing worth thinking about. If this happens, even if we let the Job fail to restart, it still can't do checkpoint, then what should we do. The scenario we assume here is that we chose HDFS as the statebackend, but it failed and could not be used in a short time. Of course, we can also choose the longest time between two successful checkpoints. Because the checkpoint period is usually fixed, and if the checkpoint always fails for one reason, then the time interval for its failure is almost equal, so it can be approximated as another expression of how many checkpoints fail consecutively. And here also reflects "coherence." But from an implementation perspective, it does lead to two different implementations. > Allowable number of checkpoint failures > ---------------------------------------- > > Key: FLINK-10074 > URL: https://issues.apache.org/jira/browse/FLINK-10074 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing > Reporter: Thomas Weise > Assignee: vinoyang > Priority: Major > > For intermittent checkpoint failures it is desirable to have a mechanism to > avoid restarts. If, for example, a transient S3 error prevents checkpoint > completion, the next checkpoint may very well succeed. The user may wish to > not incur the expense of restart under such scenario and this could be > expressed with a failure threshold (number of subsequent checkpoint > failures), possibly combined with a list of exceptions to tolerate. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)