Re: Review Request 58462: Fix bug. Do not increase current_consecutive_successes if .healthchecksnooze present

Vladimir Khalatyan Tue, 18 Apr 2017 11:18:35 -0700


> On April 14, 2017, 9:26 p.m., Santhosh Kumar Shanmugham wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py
> > Lines 163-166 (patched)
> > <https://reviews.apache.org/r/58462/diff/1/?file=1692816#file1692816line163>
> >
> >     This will cause a task to get stuck in `STARTING` since `self.running` 
> > will never be set to `True`.
> >     
> >     Can you explain the particular usecase here? Also add a test case to 
> > exercise this branch.
> 
> Vladimir Khalatyan wrote:
>     The idea is to make HealthCheck process to start after some of the setup 
> processes are finished. With the current approach it's possible to addjust 
> the "starting" point of the HealthCheck process by changing 
> initial_interval_secs. But it means that we rely on the timing which doesn't 
> guarantee anything.  
>     The idea of HealthCheck "snoozing" is ignore any status of the 
> healthcheck unless some process tells HealthCheck to start checking the 
> health of the service.
>     
>     Example (simplified one):
>      Let's assume we start two processes on the machine: the LB registration 
> and the UWSGI process. Let's say the uwsgi process requires some time to warm 
> up. The LB registration depends on the load on LB, how soon uwsgi warms up, 
> etc. So the actual moment when the application becomes available can vary 
> from couple of seconds to minutes and we can not rely on 
> initial_interval_secs. So we create a .healthchecksnooze file and ignore all 
> results of the healthcheck unless this file is there. In a meanwhile the LB 
> registration process will try to register service some number of times ( < 
> max_failures) and delete the .healthchecksnooze after it succeeds. Since this 
> particular moment the healthcheck will start incrementing the concecutive 
> successes or failures and we can determine whether the deployment is 
> successfull or not. 
>      So with this approach we can specify the "starting" point of health 
> checking more accurately and dependent on other processes. 
>      
>      Here by "starting" point of the health check I mean the checking of the 
> application health and changing the consecutive successes or failures, not 
> the actual system process.
> 
> Santhosh Kumar Shanmugham wrote:
>     > "So the actual moment when the application becomes available can vary 
> from couple of seconds to minutes and we can not rely on 
> initial_interval_secs."
>     
>     The current implementation addresses this problem of 
> `initial_interval_secs` not responding faster with varying startup times. It 
> achieves this by performing `health checks` during the startup time 
> (`initial_interval_secs`) but ignores all failures during this period, 
> however successful health checks now count towards transitioning the task to 
> a healthy (RUNNING) state. Thereby it can accomodate both slow startup as 
> well as fast startup without making the faster startup instances from waiting 
> until the entire `initial_interval_secs` has expired.
>     
>     However for your change in particular, you might also need to account for 
> `_should_enforce_deadline` - which will treat a task as unhealthy if it runs 
> out of attempts.
> 
> Santhosh Kumar Shanmugham wrote:
>     I was just looking at the docs and your usecase of health-check snoozing 
> is vastly different from the usecase the documentation - 
>     
>     > You can pause health checking by touching a file inside of your 
> sandbox, named .healthchecksnooze. As long as that file is present, health 
> checks will be disabled, enabling users to gather core dumps or other 
> performance measurements without worrying about Aurora’s health check killing 
> their process.
>     
>     Using health check snoozing to achieve synchronization of health checks, 
> is indicative of how inflexible the health-checking mechanism is in reality. 
> :(
> 
> Stephan Erb wrote:
>     Personally speaking, I would not want to run without a sane 
> `initial_interval_secs` (e.g. 5-10 minutes). In distributed systems, things 
> tend to stall from time to time. Not having a timeout to unstuck a deployment 
> seems very problematic to me.
>     
>     There is a recent discussion on the mailing list discovering a related 
> issue: 
> https://lists.apache.org/thread.html/a0b01ba8928dc637cff223d2ff74a8d00436850954d61150fc709ca4@%3Cuser.aurora.apache.org%3E
> 
> Vladimir Khalatyan wrote:
>     Santhosh, so what can you suggest in my case? Let's assume the LB 
> registration process depends on the latency, readiness of the service, etc. 
> Similar dependencied for uwsgi process. What is the initial_interval_secs 
> should I set? Or how can I determine what is the interval supposed to be? 1 
> minute? 10 minutes? 30 minutes? I can only guess. But if there are any delays 
> on the LB register or uwsgi warm up, everything could screw up. So it's not 
> flexible anymore. 
>     With the approach of disabling the health check process, it's flexible. 
> Ideally I don't want to health check my application unless it's ready to run. 
> The readiness could be checked on my side, inside my processes. And somehow I 
> should be able to tell health check that my application is ready for 
> checking. The ".healthchecksnooze" file could be a great candidate for that.

I am not saying that there is no control of the timeout. 
Let's simplify everything. We have process A (some setup process), process B 
(another setup process) and process C (the main daemon process of the 
application).
We would like to execute these processes in an order of A->B->C.

Unfortunately currently there is no control of when to start the HealthCheck 
(HC) unless trying to predict the initial_interval_secs (which basically means 
don't accept failures unless all setups are finished and my main process is 
started) and few other parameters. For example in case I add one more process D 
between A and B, I would need to recalculate the initial_interval_secs in order 
to have the system work properly. 

Ideally you would like to start A, then B, then C and start the HC after it. So 
the .healthchecksnooze file would be an identifier of when to start the HC 
(unless it's possible to have a dependency for the HC, which would be awesome 
and simplify everything). What it means is that "Don't run the HC if this file 
exists". So after all processes A,B and C are executed properly, one of those 
processes (for example C) can delete the .healthchecksnooze file and that 
basically means that we can start a HC. We still can have initial_interval_secs 
if we want but probably min_consecutive_successes and max_consecutive_failures 
are enough. With this file we can shift the starting point of healthchecking to 
whenever we want depending on our configurations instead of having it running 
right at the beginning of the Task.

And still we have the controlling configurations for each of these processes. 
You can specify what is timeout for A, how many max failures acceptable for B, 
etc. So if something "stuck" at some point and reach the failure requirements 
the deployment would fail.

- Vladimir

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58462/#review172032
-----------------------------------------------------------

On April 14, 2017, 8:35 p.m., Vladimir Khalatyan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58462/
> -----------------------------------------------------------
> 
> (Updated April 14, 2017, 8:35 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Fix bug. Do not increase current_consecutive_successes if .healthchecksnooze 
> present
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> e9e4129af2db5202a82e9f6d54109a00bbae97ce 
> 
> 
> Diff: https://reviews.apache.org/r/58462/diff/1/
> 
> 
> Testing
> -------
> 
> The Health Check is succeeding when the .healthchecksnooze is present. But it 
> should just snooze which means there shouldn't be any increase in consecutive 
> successes or consecutive failures.
> 
> 
> Thanks,
> 
> Vladimir Khalatyan
> 
>

Re: Review Request 58462: Fix bug. Do not increase current_consecutive_successes if .healthchecksnooze present

Reply via email to