Hi Ron,

you should be able to turn off the Task failure in case of a checkpoint
failure by setting `ExecutionConfig.setFailTaskOnCheckpointError(false)`.
This setting should change the behavior such that checkpoint failures will
simply fail the distributed checkpoint.


On Tue, Feb 13, 2018 at 11:41 PM, Ron Crocker wrote:

> What would it take to be a little more flexible in handling checkpoint
> failures?
> Right now I have a team that’s checkpointing into S3, via the
> FsStateBackend and an appropriate URL. Sometimes these checkpoints fail.
> They’re transient, though, and a retry would likely work.
> However, when they fail, their job exits and restarts from the last
> checkpoint. That’s fine, but I’d rather it tried again before failing, and
> even after failing just keep running and do another checkpoint. Maybe this
> is something that should be configurable - # of retries, failure strategy, …
> Ron

