Re: Review Request 53590: Change job updates to rely on `health-checks` rather than on `watch_secs`.

Santhosh Kumar Shanmugham Sun, 13 Nov 2016 21:09:58 -0800


> On Nov. 9, 2016, 8:31 a.m., Joshua Cohen wrote:
> > RELEASE-NOTES.md, lines 19-21
> > <https://reviews.apache.org/r/53590/diff/1/?file=1559232#file1559232line19>
> >
> >     We should expand this to include the fact that `watch_secs` must be set 
> > to 0 for this behavior to kick in?
> 
> Stephan Erb wrote:
>     Please also include that a service will remain in the `STARTING` state 
> until its first health check has passed.


Fixed.

@Stephan: Added - 'A service will remain in `STARTING` state util 
`min_consecutive_successes` consecutive health checks have passed.'


> On Nov. 9, 2016, 8:31 a.m., Joshua Cohen wrote:
> > docs/development/design-documents.md, line 11
> > <https://reviews.apache.org/r/53590/diff/1/?file=1559233#file1559233line11>
> >
> >     There's already a link further down for the original design doc. We 
> > should probably replace that one with the new one (and make sure the new 
> > one references the original one?)

Done.


> On Nov. 9, 2016, 8:31 a.m., Joshua Cohen wrote:
> > docs/features/job-updates.md, lines 51-53
> > <https://reviews.apache.org/r/53590/diff/1/?file=1559234#file1559234line51>
> >
> >     Oof, this is woefully out of date (the client doesn't do any of this 
> > anymore, the scheduler does).

Done.


> On Nov. 9, 2016, 8:31 a.m., Joshua Cohen wrote:
> > docs/features/job-updates.md, line 56
> > <https://reviews.apache.org/r/53590/diff/1/?file=1559234#file1559234line56>
> >
> >     should detail where this value is set (i.e. "by introducing a 
> > `min_consecutive_successess` parameter on the `HealthCheckConfig` object")

Done.


> On Nov. 9, 2016, 8:31 a.m., Joshua Cohen wrote:
> > docs/features/job-updates.md, line 57
> > <https://reviews.apache.org/r/53590/diff/1/?file=1559234#file1559234line57>
> >
> >     "the `RUNNING` state"

Done.


> On Nov. 9, 2016, 8:31 a.m., Joshua Cohen wrote:
> > docs/features/job-updates.md, line 60
> > <https://reviews.apache.org/r/53590/diff/1/?file=1559234#file1559234line60>
> >
> >     s/Inorder/In order

Done.


> On Nov. 9, 2016, 8:31 a.m., Joshua Cohen wrote:
> > docs/features/job-updates.md, line 58
> > <https://reviews.apache.org/r/53590/diff/1/?file=1559234#file1559234line58>
> >
> >     same here for `RUNNING` and `FAILED`

Done.


> On Nov. 9, 2016, 8:31 a.m., Joshua Cohen wrote:
> > docs/features/job-updates.md, line 65
> > <https://reviews.apache.org/r/53590/diff/1/?file=1559234#file1559234line65>
> >
> >     copy editor nit: I think if you replace "purely based" with "based 
> > only" this reads a little bit better.

Done.


> On Nov. 9, 2016, 8:31 a.m., Joshua Cohen wrote:
> > docs/reference/configuration.md, line 382
> > <https://reviews.apache.org/r/53590/diff/1/?file=1559235#file1559235line382>
> >
> >     s/health-check failures are ignore/during which health-check failures 
> > are ignored

Done.


> On Nov. 9, 2016, 8:31 a.m., Joshua Cohen wrote:
> > src/main/python/apache/aurora/client/config.py, lines 115-128
> > <https://reviews.apache.org/r/53590/diff/1/?file=1559237#file1559237line115>
> >
> >     for k, v in [('watch_secs', watch_secs), (...)]:
> >           die(INVALID_VALUE_ERROR_FORMAT % (k, v))
> >     
> >     is a little less repetitive?

Agree. Done.


> On Nov. 9, 2016, 8:31 a.m., Joshua Cohen wrote:
> > src/main/python/apache/aurora/executor/aurora_executor.py, line 167
> > <https://reviews.apache.org/r/53590/diff/1/?file=1559239#file1559239line167>
> >
> >     let's give the reason arg a name rather than accessing it out of 
> > `*args`?

Done.


> On Nov. 9, 2016, 8:31 a.m., Joshua Cohen wrote:
> > src/main/python/apache/aurora/executor/aurora_executor.py, lines 175-180
> > <https://reviews.apache.org/r/53590/diff/1/?file=1559239#file1559239line175>
> >
> >     Why not just change the signature of `from_assigned_task` on the 
> > `StatusCheckerProvider` interface so we don't need to perform this 
> > `instanceof` check? Non-health-check providers can just ignore the callback 
> > param.

Done.


> On Nov. 9, 2016, 8:31 a.m., Joshua Cohen wrote:
> > src/main/python/apache/aurora/executor/aurora_executor.py, lines 176-177
> > <https://reviews.apache.org/r/53590/diff/1/?file=1559239#file1559239line176>
> >
> >     style nit: one arg per line for continuation indents

Done.


> On Nov. 9, 2016, 8:31 a.m., Joshua Cohen wrote:
> > src/main/python/apache/aurora/executor/aurora_executor.py, line 180
> > <https://reviews.apache.org/r/53590/diff/1/?file=1559239#file1559239line180>
> >
> >     ditto here, re: continuation

Removed this line.


> On Nov. 9, 2016, 8:31 a.m., Joshua Cohen wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, line 97
> > <https://reviews.apache.org/r/53590/diff/1/?file=1559240#file1559240line97>
> >
> >     rename this to `start_time`?

Done.


> On Nov. 9, 2016, 8:31 a.m., Joshua Cohen wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, line 98
> > <https://reviews.apache.org/r/53590/diff/1/?file=1559240#file1559240line98>
> >
> >     inline this below?

Done.


> On Nov. 9, 2016, 8:31 a.m., Joshua Cohen wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, line 146
> > <https://reviews.apache.org/r/53590/diff/1/?file=1559240#file1559240line146>
> >
> >     Set this after the callback is dispatched? I doubt an error invoking 
> > the callback is recoverable, but it seems safer/cleaner to only set this 
> > after we know for certain that the callback has invoked successfully.

Done.


> On Nov. 9, 2016, 8:31 a.m., Joshua Cohen wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, lines 
> > 165-169
> > <https://reviews.apache.org/r/53590/diff/1/?file=1559240#file1559240line165>
> >
> >     Can you add a comment here explaining why this is safe vis-a-vis not 
> > having a potentially dangling health check outside of the interval that 
> > could satisfy the min consecutive? I.e. the deadline is calculated based on 
> > the initial interval + one interval for each required health check?

Done.


> On Nov. 9, 2016, 8:31 a.m., Joshua Cohen wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, lines 98-99
> > <https://reviews.apache.org/r/53590/diff/1/?file=1559240#file1559240line98>
> >
> >     If we have a single failed health check after the grace period has 
> > expired, it will never be possible for the task to be considered healthy 
> > and thus will always fail, right?
> >     
> >     Imagine `initial_interval_secs` set to 60, `interval_secs` set to 10, 
> > `min_consecutive_successes` set to 3, and `max_consecutive_failures` set to 
> > 2.
> >     
> >     So, the deadline is 90 seconds.
> >     
> >     First 60 seconds we ignore health check failures. At t70 we perform a 
> > health check and it fails. We increment `current_consecutive_failures` to 1 
> > but since we haven't exceeded `max_consecutive_failures` we keep waiting. 
> > Now at t80 we perform another health check and it succeeds. We increment 
> > `current_consecutive_successes` to 1 and reset 
> > `current_consecutive_failures` to 0. `current_consecutive_successes` is < 
> > 3, so we keep going. Repeat at t90, so `current_consecutive_successes` is 
> > now 2. At t100 we check and see that we've exceeded the deadline 
> > `current_consecutive_successes` is 2 which is less than 
> > `min_consecutive_successes` so we fail the task. After that first failure 
> > we never could have succeeded.
> >     
> >     Is this the desired behavior? If so, should we short circuit on the 
> > first failure outside of the grace period regardless of the value of  
> > `max_consecutive_failures`?
> >     
> >     Also, even if we pass all health checks, if we don't start passing 
> > health checks until after the grace period and there's even a millisecond 
> > of delay at any time in the health check loop, since the final deadline 
> > expiration is checked before the health check is run, even if we're firing 
> > the third health check at ts90001 (for a 90 second deadline), we'll 
> > consider the deadline expired and thus the task failed before get a chance 
> > to run the third health check. Again, is that the desired outcome (a case 
> > can be made that your grace period should account for this...).

Added logic to fail-fast when there are not enough attempts left to reach 
success.

Changed it so that decisions are based on number of attempts instead of time 
spent, which should solve the issue of sleeping slightly over the 
`interval_secs` by `epsilon` does not cause a task to fail by stopping it from 
attempting a health check to succeed. What this means is that -

`deadline` changes to `max_attempts_until_callback = 
math.ceil(initial_interval_secs/interval_secs) + min_consecutive_successes`

We track progress in time via `attempts` -> number of attempts completed so far.


- Santhosh Kumar


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53590/#review155440
-----------------------------------------------------------


On Nov. 9, 2016, 12:29 a.m., Santhosh Kumar Shanmugham wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53590/
> -----------------------------------------------------------
> 
> (Updated Nov. 9, 2016, 12:29 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
>     https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Make RUNNING a first class state to indicate that the task is running
> and is healthy. It is achieved by introducing a new configuration
> parameter `min_consecutive_successes`, which will dictate when to move
> a task into RUNNING state.
> 
> With this change, it is possible to set the `watch_secs` to 0, so that
> updates are purely based on the task's health, rather than relying on
> watching the task to in RUNNING state for a pre-determined timeout.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 392496284b7f046a5dd29a5a3f175305c28a5afe 
>   docs/development/design-documents.md 
> 6bfc679c841ef1b0861758ccc12a5150f1bdb5e3 
>   docs/features/job-updates.md 792f2ae5fd14b1ea9af8be000629ce5a7fc2fe8f 
>   docs/reference/configuration.md f2a0b1873f31e91f3bf0cac6f8448e8130fae688 
>   src/main/python/apache/aurora/client/api/updater_util.py 
> c649316edb876565c92cc90c9f030e153c008924 
>   src/main/python/apache/aurora/client/config.py 
> 0186af52f0d7d7e3981ec59bf6a01aafee2bcfb1 
>   src/main/python/apache/aurora/config/schema/base.py 
> 845163043b0b7b2f9e7aca14677ca9f094658551 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> aee5e56ae66ec7a6acbe97d94d6187fd8646ec9a 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 3c7c09d173c4adbd89eddaee22cffcdd8d268378 
>   src/test/python/apache/aurora/client/test_config.py 
> 5cf68a5145ddf9478baa30453c0bcb73136fa7eb 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> da0c56ca084b65427a6122f22f251af8637772d1 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 3f82165333665d127e1ace765a7dada7348a4b91 
> 
> Diff: https://reviews.apache.org/r/53590/diff/
> 
> 
> Testing
> -------
> 
> buils-support/jenkins/build.sh
> sh ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>

Re: Review Request 53590: Change job updates to rely on `health-checks` rather than on `watch_secs`.

Reply via email to