[
https://issues.apache.org/jira/browse/FLINK-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16571890#comment-16571890
]
vinoyang commented on FLINK-10074:
----------------------------------
Hi [~thw] and [~till.rohrmann] :
I think it depends on our purpose. I personally think that we mainly want to
prevent a job from completing checkpoint for a long time, because in this case,
even if we let the job continue to run, once the job fails for other reasons,
after the recovery It will generate long-term rollback consumption. Here "long
time" should reflect a kind of coherence, so I personally think that we can
make a user to set a threshold to specify how many consecutive failures, we let
this job fail to recover. In the case that the threshold is not reached, we can
clear the counter once the checkpoint can be successfully completed. But there
is one more thing worth thinking about. If this happens, even if we let the Job
fail to restart, it still can't do checkpoint, then what should we do. The
scenario we assume here is that we chose HDFS as the statebackend, but it
failed and could not be used in a short time.
Of course, we can also choose the longest time between two successful
checkpoints. Because the checkpoint period is usually fixed, and if the
checkpoint always fails for one reason, then the time interval for its failure
is almost equal, so it can be approximated as another expression of how many
checkpoints fail consecutively. And here also reflects "coherence."
But from an implementation perspective, it does lead to two different
implementations.
> Allowable number of checkpoint failures
> ----------------------------------------
>
> Key: FLINK-10074
> URL: https://issues.apache.org/jira/browse/FLINK-10074
> Project: Flink
> Issue Type: Improvement
> Components: State Backends, Checkpointing
> Reporter: Thomas Weise
> Assignee: vinoyang
> Priority: Major
>
> For intermittent checkpoint failures it is desirable to have a mechanism to
> avoid restarts. If, for example, a transient S3 error prevents checkpoint
> completion, the next checkpoint may very well succeed. The user may wish to
> not incur the expense of restart under such scenario and this could be
> expressed with a failure threshold (number of subsequent checkpoint
> failures), possibly combined with a list of exceptions to tolerate.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)