What would it take to be a little more flexible in handling checkpoint
Right now I have a team that’s checkpointing into S3, via the FsStateBackend
and an appropriate URL. Sometimes these checkpoints fail. They’re transient,
though, and a retry would likely work.
However, when they fail, their job exits and restarts from the last checkpoint.
That’s fine, but I’d rather it tried again before failing, and even after
failing just keep running and do another checkpoint. Maybe this is something
that should be configurable - # of retries, failure strategy, …