Hi Lakshmi, Your understanding of " *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct, If this is set to false, the task will only decline a the checkpoint and continue running.
I think it is also a good choice to allow a number of failures to be set. Flink currently only supports whether the Task fails if the checkpoint fails. It is not supported to configure a threshold. You can create an issue in JIRA to feedback this requirement. Thanks, vino. 2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <l...@lyft.com>: > Hi, > > We are running into intermittent checkpoint failures while checkpointing to > S3. > > As described in this thread - > http://apache-flink-user-mailing-list-archive.2336050. > n4.nabble.com/1-5-some-thing-weird-td21309.html > <http://apache-flink-user-mailing-list-archive.2336050. > n4.nabble.com/1-5-some-thing-weird-td21309.html>, > we see that the job restarts when it encounters such a failure. > > As mentioned in the thread, I see that there is an option to not fail tasks > on checkpoint errors - > *CheckpointConfig#setFailOnCheckpointingErrors(false)**. *However, this > would mean that the job would continue running even in the case of > persistent checkpoint failures. Is my understanding here correct? > > If above is true, then is there a way to configure an allowable number of > checkpoint failures? i.e. something along the lines of "Don't fail the job > if there are <=X number of checkpoint failures", so that *only *transient > failures can be ignored. > > Thanks, > Lakshmi >