Hi Ron, Keep in mind, though, that this feature will only be available with the upcoming Flink 1.5. Just making sure you don't go looking for this and are surprised if you don't find it.
Best, Aljoscha > On 14. Feb 2018, at 10:20, Till Rohrmann <trohrm...@apache.org> wrote: > > Hi Ron, > > you should be able to turn off the Task failure in case of a checkpoint > failure by setting `ExecutionConfig.setFailTaskOnCheckpointError(false)`. > This setting should change the behavior such that checkpoint failures will > simply fail the distributed checkpoint. > > Cheers, > Till > > On Tue, Feb 13, 2018 at 11:41 PM, Ron Crocker <rcroc...@newrelic.com> wrote: > >> What would it take to be a little more flexible in handling checkpoint >> failures? >> >> Right now I have a team that’s checkpointing into S3, via the >> FsStateBackend and an appropriate URL. Sometimes these checkpoints fail. >> They’re transient, though, and a retry would likely work. >> >> However, when they fail, their job exits and restarts from the last >> checkpoint. That’s fine, but I’d rather it tried again before failing, and >> even after failing just keep running and do another checkpoint. Maybe this >> is something that should be configurable - # of retries, failure strategy, … >> >> Ron