[ 
https://issues.apache.org/jira/browse/FLINK-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16571889#comment-16571889
 ] 

vinoyang commented on FLINK-10074:
----------------------------------

Hi [~thw] and [~till.rohrmann] : 

I think it depends on our purpose. I personally think that we mainly want to 
prevent a job from completing checkpoint for a long time, because in this case, 
even if we let the job continue to run, once the job fails for other reasons, 
after the recovery It will generate long-term rollback consumption. Here "long 
time" should reflect a kind of coherence, so I personally think that we can 
make a user to set a threshold to specify how many consecutive failures, we let 
this job fail to recover. In the case that the threshold is not reached, we can 
clear the counter once the checkpoint can be successfully completed. But there 
is one more thing worth thinking about. If this happens, even if we let the Job 
fail to restart, it still can't do checkpoint, then what should we do. The 
scenario we assume here is that we chose HDFS as the statebackend, but it 
failed and could not be used in a short time.

Of course, we can also choose the longest time between two successful 
checkpoints. Because the checkpoint period is usually fixed, and if the 
checkpoint always fails for one reason, then the time interval for its failure 
is almost equal, so it can be approximated as another expression of how many 
checkpoints fail consecutively. And here also reflects "coherence."

But from an implementation perspective, it does lead to two different 
implementations.

 

 

> Allowable number of checkpoint failures 
> ----------------------------------------
>
>                 Key: FLINK-10074
>                 URL: https://issues.apache.org/jira/browse/FLINK-10074
>             Project: Flink
>          Issue Type: Improvement
>          Components: State Backends, Checkpointing
>            Reporter: Thomas Weise
>            Assignee: vinoyang
>            Priority: Major
>
> For intermittent checkpoint failures it is desirable to have a mechanism to 
> avoid restarts. If, for example, a transient S3 error prevents checkpoint 
> completion, the next checkpoint may very well succeed. The user may wish to 
> not incur the expense of restart under such scenario and this could be 
> expressed with a failure threshold (number of subsequent checkpoint 
> failures), possibly combined with a list of exceptions to tolerate.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to