[
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15569600#comment-15569600
]
Kai Huang commented on AURORA-1791:
-----------------------------------
We've decided to revert the commit.
The changes that directly causes problems is:
Modify executor state transition logic to rely on health checks (if enabled).
commit ca683cb9e27bae76424a687bc6c3af5a73c501b9
There are two downstream commits that depends on the above commit:
Add min_consecutive_health_checks in HealthCheckConfig
commit ed72b1bf662d1e29d2bb483b317c787630c26a9e
Add support for receiving min_consecutive_successes in health checker
commit e91130e49445c3933b6e27f5fde18c3a0e61b87a
We will drop all three of these commits and revert back to one commit before
the problematic commit:
Running task ssh without an instance should pick a random instance
commit 59b4d319b8bb5f48ec3880e36f39527f1498a31c
I will create a separate ticket for people to track the reversion.
> Commit ca683 is not backwards compatible.
> -----------------------------------------
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
> Issue Type: Bug
> Reporter: Zameer Manji
> Assignee: Kai Huang
> Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 |
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
> is not backwards compatible. The last section of the commit
> {quote}
> 4. Modified the Health Checker and redefined the meaning
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
> initial_interval_secs: 10
> interval_secs: 5
> max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to
> initial_interval_secs and have the task transition into RUNNING when
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled.
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum
> consecutive successes.
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)