> On April 14, 2017, 2:26 p.m., Santhosh Kumar Shanmugham wrote: > > src/main/python/apache/aurora/executor/common/health_checker.py > > Lines 163-166 (patched) > > <https://reviews.apache.org/r/58462/diff/1/?file=1692816#file1692816line163> > > > > This will cause a task to get stuck in `STARTING` since `self.running` > > will never be set to `True`. > > > > Can you explain the particular usecase here? Also add a test case to > > exercise this branch. > > Vladimir Khalatyan wrote: > The idea is to make HealthCheck process to start after some of the setup > processes are finished. With the current approach it's possible to addjust > the "starting" point of the HealthCheck process by changing > initial_interval_secs. But it means that we rely on the timing which doesn't > guarantee anything. > The idea of HealthCheck "snoozing" is ignore any status of the > healthcheck unless some process tells HealthCheck to start checking the > health of the service. > > Example (simplified one): > Let's assume we start two processes on the machine: the LB registration > and the UWSGI process. Let's say the uwsgi process requires some time to warm > up. The LB registration depends on the load on LB, how soon uwsgi warms up, > etc. So the actual moment when the application becomes available can vary > from couple of seconds to minutes and we can not rely on > initial_interval_secs. So we create a .healthchecksnooze file and ignore all > results of the healthcheck unless this file is there. In a meanwhile the LB > registration process will try to register service some number of times ( < > max_failures) and delete the .healthchecksnooze after it succeeds. Since this > particular moment the healthcheck will start incrementing the concecutive > successes or failures and we can determine whether the deployment is > successfull or not. > So with this approach we can specify the "starting" point of health > checking more accurately and dependent on other processes. > > Here by "starting" point of the health check I mean the checking of the > application health and changing the consecutive successes or failures, not > the actual system process.
> "So the actual moment when the application becomes available can vary from > couple of seconds to minutes and we can not rely on initial_interval_secs." The current implementation addresses this problem of `initial_interval_secs` not responding faster with varying startup times. It achieves this by performing `health checks` during the startup time (`initial_interval_secs`) but ignores all failures during this period, however successful health checks now count towards transitioning the task to a healthy (RUNNING) state. Thereby it can accomodate both slow startup as well as fast startup without making the faster startup instances from waiting until the entire `initial_interval_secs` has expired. However for your change in particular, you might also need to account for `_should_enforce_deadline` - which will treat a task as unhealthy if it runs out of attempts. - Santhosh Kumar ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/58462/#review172032 ----------------------------------------------------------- On April 14, 2017, 1:35 p.m., Vladimir Khalatyan wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/58462/ > ----------------------------------------------------------- > > (Updated April 14, 2017, 1:35 p.m.) > > > Review request for Aurora, Joshua Cohen and Zameer Manji. > > > Repository: aurora > > > Description > ------- > > Fix bug. Do not increase current_consecutive_successes if .healthchecksnooze > present > > > Diffs > ----- > > src/main/python/apache/aurora/executor/common/health_checker.py > e9e4129af2db5202a82e9f6d54109a00bbae97ce > > > Diff: https://reviews.apache.org/r/58462/diff/1/ > > > Testing > ------- > > The Health Check is succeeding when the .healthchecksnooze is present. But it > should just snooze which means there shouldn't be any increase in consecutive > successes or consecutive failures. > > > Thanks, > > Vladimir Khalatyan > >
