Why are checkpoint failures so serious?

Ron Crocker Tue, 13 Feb 2018 14:42:07 -0800

What would it take to be a little more flexible in handling checkpoint 
failures?


Right now I have a team that’s checkpointing into S3, via the FsStateBackend 
and an appropriate URL. Sometimes these checkpoints fail. They’re transient, 
though, and a retry would likely work. 

However, when they fail, their job exits and restarts from the last checkpoint. 
That’s fine, but I’d rather it tried again before failing, and even after 
failing just keep running and do another checkpoint. Maybe this is something 
that should be configurable - # of retries, failure strategy, …

Ron

Why are checkpoint failures so serious?

Reply via email to