Hi Maxim, I am not keen on the potential risk of tasks getting stuck in STARTING. We perform auto-scaling of jobs, so there might be nobody around to notice and correct the problem in time.
How about keeping the initial_interval_secs and just change its meaning to be grace period, so that health checks are triggered but errors ignored during this interval. The initial_interval_secs is then a user-configurable upper bound of when a job is meant to be working. It can even be set rather high, because it won't affect the update performance. What do you think? Best Regards, Stephan ________________________________________ From: Maxim Khutornenko <[email protected]> Sent: Tuesday, May 5, 2015 10:24 PM To: [email protected] Subject: Health Checks for Updates design review Hi, I have put together a design proposal for improving health-enabled job update performance. Please, review and leave your comments: https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit Thanks, Maxim
