Re: Review Request 52766: Fix a bug in insufficient successes during initial_interval_secs

Aurora ReviewBot Tue, 11 Oct 2016 22:26:47 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52766/#review152269
-----------------------------------------------------------



Ship it!




Master (e9abb22) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Oct. 12, 2016, 5:01 a.m., Kai Huang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52766/
> -----------------------------------------------------------
> 
> (Updated Oct. 12, 2016, 5:01 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Zameer Manji.
> 
> 
> Bugs: AURORA-1791
>     https://issues.apache.org/jira/browse/AURORA-1791
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Fix a bug in commit ca683cb. The commit is related to this review 
> https://reviews.apache.org/r/51876/. Please see it for more details and 
> backgrounds.
> 
> Currently, health checks are performed during a grace period called 
> initial_interval_secs. It is likely that HealthChecker fails to see 
> sufficient number of successes before the intitial_interval_secs expires. For 
> example, for a task with HealthCheckConfig(initital_interval_secs=15, 
> interval_secs=10, min_consecutive_successes=1). If the task sleeps during the 
> first 12 seconds and becomes healthy afterwards, the health checker will 
> report the task status as "TASK_FAILED" and miss the "healthy" status between 
> second 12-15. This is because only one health check is performed at second 10 
> before the initial_interval_secs expires. This is an implementation flaw that 
> breaks backward-compatability. 
> 
> To address this problem, I rewrite the function that is responsible for 
> updating the failure counts and the healthy status. The expected behavior is 
> that for the task described above, the health checker will performs a health 
> check after the initial_interval_secs expires and sets the health check 
> status to be healthy. Please see this review for more details.
> 
> Will add some more tests since the current e2e tests does not include the 
> above test case.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 1e0be108b49480d57c5ab94b1d2903bb57bae20a 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> 28769dca68a6353fc1283a8bb279fae05173aaac 
> 
> Diff: https://reviews.apache.org/r/52766/diff/
> 
> 
> Testing
> -------
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> Modified the test in http_example.py. Let the http server sleep for the first 
> 10 seconds.
> 
> Launch a job that contains the task with Default 
> HealthCheckConfig(initial_interval_secs=15, interval_secs=10, 
> min_consecutive_successes=1) in vagrant aurora cluster. The task transitions 
> to TASK_RUNNING state after ~20 seconds.
> 
> 
> File Attachments
> ----------------
> 
> Task with default Health Check Config
>   
> https://reviews.apache.org/media/uploaded/files/2016/10/12/64cf6610-9294-46cb-b159-6e5721da5fff__Screen_Shot_2016-10-11_at_6.17.00_PM.png
> 
> 
> Thanks,
> 
> Kai Huang
> 
>

Re: Review Request 52766: Fix a bug in insufficient successes during initial_interval_secs

Reply via email to