----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/52766/#review152248 -----------------------------------------------------------
Ship it! Master (e9abb22) is green with this patch. ./build-support/jenkins/build.sh I will refresh this build result if you post a review containing "@ReviewBot retry" - Aurora ReviewBot On Oct. 12, 2016, 1:18 a.m., Kai Huang wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/52766/ > ----------------------------------------------------------- > > (Updated Oct. 12, 2016, 1:18 a.m.) > > > Review request for Aurora, Joshua Cohen and Zameer Manji. > > > Bugs: AURORA-1791 > https://issues.apache.org/jira/browse/AURORA-1791 > > > Repository: aurora > > > Description > ------- > > Fix a bug in commit ca683cb. The commit is related to this review > https://reviews.apache.org/r/51876/. Please see it for more details and > backgrounds. > > Currently, health checks are performed during a grace period called > initial_interval_secs. It is likely that HealthChecker fails to see > sufficient number of successes before the intitial_interval_secs expires. For > example, for a task with HealthCheckConfig(initital_interval_secs=15, > interval_secs=10, min_consecutive_successes=1). If the task sleeps during the > first 12 seconds and becomes healthy afterwards, the health checker will > report the task status as "TASK_FAILED" and miss the "healthy" status between > second 12-15. This is because only one health check is performed at second 10 > before the initial_interval_secs expires. This is an implementation flaw that > breaks backward-compatability. > > To address this problem, I rewrite the function that is responsible for > updating the failure counts and the healthy status. The expected behavior is > that for the task described above, the health checker will performs a health > check after the initial_interval_secs expires and sets the health check > status to be healthy. Please see this review for more details. > > Will add some more tests since the current e2e tests does not include the > above test case. > > > Diffs > ----- > > src/main/python/apache/aurora/executor/common/health_checker.py > 1e0be108b49480d57c5ab94b1d2903bb57bae20a > src/test/python/apache/aurora/executor/common/test_health_checker.py > 28769dca68a6353fc1283a8bb279fae05173aaac > > Diff: https://reviews.apache.org/r/52766/diff/ > > > Testing > ------- > > ./build-support/jenkins/build.sh > > ./pants test.pytest src/test/python/apache/aurora/executor:: > > ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh > > Modified the test in http_example.py. Let the http server sleep for the first > 10 seconds. > > Launch a job that contains the task with Default > HealthCheckConfig(initial_interval_secs=15, interval_secs=10, > min_consecutive_successes=1) in vagrant aurora cluster. The task transitions > to TASK_RUNNING state after ~20 seconds. > > > File Attachments > ---------------- > > Task with default Health Check Config > > https://reviews.apache.org/media/uploaded/files/2016/10/12/64cf6610-9294-46cb-b159-6e5721da5fff__Screen_Shot_2016-10-11_at_6.17.00_PM.png > > > Thanks, > > Kai Huang > >