[
https://issues.apache.org/jira/browse/AURORA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491591#comment-14491591
]
Stephan Erb commented on AURORA-894:
------------------------------------
I believe there is a smaller story embedded into this one which is not blocked
by AURORA-279 and therefore easier to implement.
We could start by introducing the {{STARTING}} state and transition a job to
{{RUNNING}} when the first {{min_consecutive_health_checks}} have passed. This
requires the introduction of the new state on server and executor side, but
keeps the updater out of the loop.
This smaller story also has immediate benefit: Right now, when implementing a
dashboard or monitoring for services on Aurora, one always has to re-implement
health checks. Just looking at the {{RUNNING}} state is not enough because the
service might be starting instead of serving requests. With the proposed change
however, Aurora guarantees me that a {{RUNNING}} service is always healthy
(modulo the acceptable inconsistency window of the health check interval).
> Server updater should watch healthy instances
> ---------------------------------------------
>
> Key: AURORA-894
> URL: https://issues.apache.org/jira/browse/AURORA-894
> Project: Aurora
> Issue Type: Epic
> Components: Scheduler
> Reporter: Maxim Khutornenko
> Assignee: Maxim Khutornenko
> Labels: 2015-Q2
>
> Instead of starting the {{minWaitInInstanceRunningMs}} (aka {{watch_secs}})
> countdown when an instance reaches RUNNING state, the updater should rely on
> the first successful health check instead. This will potentially speed up
> updates as the {{minWaitInInstanceRunningMs}} will no longer have to be
> chosen based on the worst observed instance startup/warmup delay but rather
> as a desired health check duration according to the following formula:
> {noformat}
> minWaitInInstanceRunningMs = interval_secs x num_desired_healthchecks x 1000
> {noformat}
> where:
> {{interval_secs}} -
> https://github.com/apache/incubator-aurora/blob/master/docs/configuration-reference.md#healthcheckconfig-objects
> {{num_desired_healthchecks}} - the desired number of OK health checks to
> observe before declaring an instance updated successfully
>
> The above would allow every instance to start watching interval depending on
> the individual instance performance and potentially exit updater earlier.
> This feature requires AURORA-279.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)