Re: Why are checkpoint failures so serious?

Till Rohrmann Wed, 14 Feb 2018 01:21:07 -0800

Hi Ron,

you should be able to turn off the Task failure in case of a checkpoint
failure by setting `ExecutionConfig.setFailTaskOnCheckpointError(false)`.
This setting should change the behavior such that checkpoint failures will
simply fail the distributed checkpoint.


Cheers,
Till

On Tue, Feb 13, 2018 at 11:41 PM, Ron Crocker <[email protected]> wrote:

> What would it take to be a little more flexible in handling checkpoint
> failures?
>
> Right now I have a team that’s checkpointing into S3, via the
> FsStateBackend and an appropriate URL. Sometimes these checkpoints fail.
> They’re transient, though, and a retry would likely work.
>
> However, when they fail, their job exits and restarts from the last
> checkpoint. That’s fine, but I’d rather it tried again before failing, and
> even after failing just keep running and do another checkpoint. Maybe this
> is something that should be configurable - # of retries, failure strategy, …
>
> Ron

Re: Why are checkpoint failures so serious?

Reply via email to