Setting an allowable number of checkpoint failures

Lakshmi Gururaja Rao Fri, 03 Aug 2018 13:29:22 -0700

Hi,

We are running into intermittent checkpoint failures while checkpointing to
S3.


As described in this thread -
 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/1-5-some-thing-weird-td21309.html
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/1-5-some-thing-weird-td21309.html>,
we see that the job restarts when it encounters such a failure.

As mentioned in the thread, I see that there is an option to not fail tasks
on checkpoint errors -
*CheckpointConfig#setFailOnCheckpointingErrors(false)**. *However, this
would mean that the job would continue running even in the case of
persistent checkpoint failures. Is my understanding here correct?

If above is true, then is there a way to configure an allowable number of
checkpoint failures? i.e. something along the lines of "Don't fail the job
if there are <=X number of checkpoint failures", so that *only *transient
failures can be ignored.

Thanks,
Lakshmi

Setting an allowable number of checkpoint failures

Reply via email to