Kai Huang commented on AURORA-1791:

 Thanks for the examples, [~davmclau] It seems the above example you provided 
is similar to approach(a) that I implemented in the patch.
One thing that is worth noting is that this approach relies on the latest 
health check, even if it's way beyond initial_interval_secs expires.
Consider the case where initial_interval_secs = 150, interval_secs = 100, the 
task becomes healthy at second 101, we won't be able to see the healthy state 
until second 200. This is some grey area in the design doc which we need to 
discuss more. 

I agree with you that we should take a step back, since it seems we need more 
discussion on whether min_consecutive_successes should trigger TASK state 
transition after initial_interval_secs expires. 

> Commit ca683 is not backwards compatible.
> -----------------------------------------
>                 Key: AURORA-1791
>                 URL: https://issues.apache.org/jira/browse/AURORA-1791
>             Project: Aurora
>          Issue Type: Bug
>            Reporter: Zameer Manji
>            Assignee: Kai Huang
>            Priority: Blocker
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>       initial_interval_secs: 10
>       interval_secs: 5
>       max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}

This message was sent by Atlassian JIRA

Reply via email to