Re: Review Request 54299: Extend warm-up time by `max_consecutive_failures` attempts.

Santhosh Kumar Shanmugham Sun, 04 Dec 2016 16:17:42 -0800


> On Dec. 2, 2016, 1:44 p.m., Joshua Cohen wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, lines 
> > 115-117
> > <https://reviews.apache.org/r/54299/diff/1/?file=1574585#file1574585line115>
> >
> >     There still exists the chance for a backwards incompatibility here. 
> > Under the previous watch-driven updates, a task could flip between failing 
> > and successful health checks, and as long as it's still running at the end 
> > of `watch_secs` the updater would consider it healthy and move on. With 
> > this new behavior, someone could configure a task in such a way that the 
> > max attempts are consumed without reaching `max_consecutive_failures` or 
> > `min_consecutive_successes` before `watch_secs` is elapsed, meaning that 
> > the task would fail.
> >     
> >     As we discussed earlier, if we make `watch_secs` and 
> > `min_consecutive_successes` mutually exclusive in the client, then the 
> > executor could only trigger the new behavior if the user opted in by 
> > setting `watch_secs` to 0 and `min_consecutive_successes` to non-zero.

I believe that the situation you are describing would occur only when 
`min_consecutive_successes > 1`, which means that user has already opted in for 
the new behavior.

#Old behavior:#
#*With `watch_secs`*#
Task starts in `RUNNING` state. Task has to report atleast 1 success within the 
first `initial_interval_secs + max_consecutive_failures * interval_secs` (no 
health checks are done during the `initial_interval_secs`, hence it means no 
`max_consecutive_failures + 1`). Following this, the task must report atleast 1 
success after every `max_consecutive_failures` to remain in `RUNNING`, until 
`watch_secs` expires.

#*Without `watch_secs`#*
Task starts in `RUNNING` state. Task has to report atleast 1 success within the 
first `initial_interval_secs + max_consecutive_failures * interval_secs` (no 
health checks are done during the `initial_interval_secs`, hence it means no 
`max_consecutive_failures + 1`).

#New behavior:#
#*With `watch_secs`*#
Task has to report atleast `min_consecutive_successes` (default=1) within the 
first `initial_interval_secs + (max_consecutive_failures + 
min_consecutive_successes) * interval_secs` to move to `RUNNING` state. 
Following this, the task must report atleast 1 success after every 
`max_consecutive_failures` to remain in `RUNNING`, until `watch_secs` expires.

#*Without `watch_secs`#*
Task has to report atleast `min_consecutive_successes` (default=1) within the 
first `initial_interval_secs + (max_consecutive_failures + 
min_consecutive_successes) * interval_secs` to move to `RUNNING` state.

Once in `RUNNING`, `min_consecutive_successes` is irrelevant, since the only 
transition possible is from `RUNNING` to a terminal state. Hence it is enough 
for a task to report just 1 successes every `max_consecutive_failures` to 
remain healthy. One might argue that `min_consecutive_successes` is not at all 
necessary in the first place. On the other hand once can argue that, this will 
serve as a replacement mechanism in-place of `watch_secs` to enforce tighter 
healthiness conditions before treating a task as successfully updated, thereby 
avoiding bad updates from succeeding.

All in all, setting `min_consecutive_successes` to 1 as the default should 
provide us with the necessary backward-compatibility.

Please refer to the diagrams in the design document. 
https://docs.google.com/document/d/1KOO0LC046k75TqQqJ4c0FQcVGbxvrn71E10wAjMorVY/edit?usp=sharing

> On Dec. 2, 2016, 1:44 p.m., Joshua Cohen wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, line 113
> > <https://reviews.apache.org/r/54299/diff/1/?file=1574585#file1574585line113>
> >
> >     s/suppose/supposed

Done.

- Santhosh Kumar

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54299/#review157764
-----------------------------------------------------------

On Dec. 2, 2016, 12:43 a.m., Santhosh Kumar Shanmugham wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54299/
> -----------------------------------------------------------
> 
> (Updated Dec. 2, 2016, 12:43 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, Stephan Erb, and 
> Zameer Manji.
> 
> 
> Bugs: AURORA-1841
>     https://issues.apache.org/jira/browse/AURORA-1841
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> It is possible to set the health checks such that a task can
> continually fail health checks with intermittent successes and still
> succeed an update. Essentially a task fails health checks during the
> `initial_interval_secs` and an additional `max_consecutive_failures`,
> and then perform a successful health check to become healthy.
> 
> To be backward compatible to the above configuration, include the
> `max_consecutive_failures` when computing `max_attempts_to_running`.
> 
> 
> Diffs
> -----
> 
>   docs/features/services.md 50189eeff26ce9614d092f6abd9246788647fe2b 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 12af9d8635a553eabe918a86508aa6ce2fd78a49 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> e2a7f164a24f49dd1f4cdba136e838b9d42d73a2 
> 
> Diff: https://reviews.apache.org/r/54299/diff/
> 
> 
> Testing
> -------
> 
> build-support/jenkins/build.sh
> src/test/sh/org/apacher/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>

Re: Review Request 54299: Extend warm-up time by `max_consecutive_failures` attempts.

Reply via email to