Re: Health Checks for Updates design review
On 2 July 2015 at 19:16, Maxim Khutornenko ma...@apache.org wrote: Hi Brian, This feature is on a back burner for now. It’s unlikely that we'll make any progress within the next few months. The design is mostly hashed out at this point, so if you feel you could contribute or even better take it over completely it would be absolutely awesome! Thanks for the update, unfortunately that whole feature is a bit big for me to chew off all of anytime soon. Yours, Brian Thanks, Maxim On Thu, Jul 2, 2015 at 9:27 AM, Brian Brazil brian.bra...@boxever.com wrote: On 5 May 2015 at 21:24, Maxim Khutornenko ma...@apache.org wrote: Hi, I have put together a design proposal for improving health-enabled job update performance. Please, review and leave your comments: https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit Hi, I'm looking to move some of our production jobs to Aurora, and this design is of interest as I want to allow slack in startup and for a few failures without greatly increasing update times. Is there a rough ETA for this, or maybe some smaller tasks I could help with? 1224 sounds easy for example, and I've already worked with that code. Thanks, Brian Thanks, Maxim
Re: Health Checks for Updates design review
Hi Brian, This feature is on a back burner for now. It’s unlikely that we'll make any progress within the next few months. The design is mostly hashed out at this point, so if you feel you could contribute or even better take it over completely it would be absolutely awesome! Thanks, Maxim On Thu, Jul 2, 2015 at 9:27 AM, Brian Brazil brian.bra...@boxever.com wrote: On 5 May 2015 at 21:24, Maxim Khutornenko ma...@apache.org wrote: Hi, I have put together a design proposal for improving health-enabled job update performance. Please, review and leave your comments: https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit Hi, I'm looking to move some of our production jobs to Aurora, and this design is of interest as I want to allow slack in startup and for a few failures without greatly increasing update times. Is there a rough ETA for this, or maybe some smaller tasks I could help with? 1224 sounds easy for example, and I've already worked with that code. Thanks, Brian Thanks, Maxim
Re: Health Checks for Updates design review
On 5 May 2015 at 21:24, Maxim Khutornenko ma...@apache.org wrote: Hi, I have put together a design proposal for improving health-enabled job update performance. Please, review and leave your comments: https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit Hi, I'm looking to move some of our production jobs to Aurora, and this design is of interest as I want to allow slack in startup and for a few failures without greatly increasing update times. Is there a rough ETA for this, or maybe some smaller tasks I could help with? 1224 sounds easy for example, and I've already worked with that code. Thanks, Brian Thanks, Maxim
Re: Health Checks for Updates design review
Hi Maxim, I am not keen on the potential risk of tasks getting stuck in STARTING. We perform auto-scaling of jobs, so there might be nobody around to notice and correct the problem in time. How about keeping the initial_interval_secs and just change its meaning to be grace period, so that health checks are triggered but errors ignored during this interval. The initial_interval_secs is then a user-configurable upper bound of when a job is meant to be working. It can even be set rather high, because it won't affect the update performance. What do you think? Best Regards, Stephan From: Maxim Khutornenko ma...@apache.org Sent: Tuesday, May 5, 2015 10:24 PM To: dev@aurora.apache.org Subject: Health Checks for Updates design review Hi, I have put together a design proposal for improving health-enabled job update performance. Please, review and leave your comments: https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit Thanks, Maxim
Re: Health Checks for Updates design review
Thanks for your comment, Stephan. I have moved it into the doc to keep discussion history in one place. On Wed, May 6, 2015 at 1:33 AM, Erb, Stephan stephan@blue-yonder.com wrote: Hi Maxim, I am not keen on the potential risk of tasks getting stuck in STARTING. We perform auto-scaling of jobs, so there might be nobody around to notice and correct the problem in time. How about keeping the initial_interval_secs and just change its meaning to be grace period, so that health checks are triggered but errors ignored during this interval. The initial_interval_secs is then a user-configurable upper bound of when a job is meant to be working. It can even be set rather high, because it won't affect the update performance. What do you think? Best Regards, Stephan From: Maxim Khutornenko ma...@apache.org Sent: Tuesday, May 5, 2015 10:24 PM To: dev@aurora.apache.org Subject: Health Checks for Updates design review Hi, I have put together a design proposal for improving health-enabled job update performance. Please, review and leave your comments: https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit Thanks, Maxim
Health Checks for Updates design review
Hi, I have put together a design proposal for improving health-enabled job update performance. Please, review and leave your comments: https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit Thanks, Maxim