[
https://issues.apache.org/jira/browse/AURORA-279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974658#comment-14974658
]
Brian Weber commented on AURORA-279:
------------------------------------
It shouldn't be too much to ask for a guard rail to prevent a health check
reaction from taking down an entire job. It also doesn't look like a huge
addition to add an integer for max_concurrent_restarts or something like that
(perhaps default to a batch size?) to permit customers who don't have central
remediation frameworks to allow aurora to manage the failure rates.
e.g.: a job with 1000 instances can serve well enough with 10% (100 instances)
down. Let's suppose a bug running wild and instances arbitrarily start
responding unhealthy. If a restart temporarily fixes the bug until the next
deploy, cool. If the bug hits enough instances that between the bug and the
existing restarts that over 100 instances are down, then the configured health
check would take down enough instances that the service would potentially stop
serving well at all.
Suppose instead, thermos queried aurora for permission to remediate, and aurora
could then ratelimit remediations and send a notification to someone so they
can respond more immediately. Aurora can then know that 10% of the fleet is
down, and hold off while a human is notified". It would then be up to the
notified party to decide whether to fix the bug right there.
- It may be 3am when nobody is awake, so the action may be to just restart the
entire job.
- It may be a low traffic point, in which case one may decide to adjust the
threshold.
- It may be a critical time because the entire site is on fire, and only one
service is less important.
- It may be important enough that the decision is made to push a bugfix right
then and there, which is not always an easy task.
The only action in thermos would be to query aurora for permission, which would
be a boolean response. The only action in aurora would be to compare number of
not-healthy instances to a ratelimit (e.g., if not_serving_instances >
rate_limit: return False). This doesn't seem too complicated to build in and
would give aurora a great bit of repair power.
> Allow scheduler to decide how to respond to task health check failures
> ----------------------------------------------------------------------
>
> Key: AURORA-279
> URL: https://issues.apache.org/jira/browse/AURORA-279
> Project: Aurora
> Issue Type: Story
> Components: Executor, Scheduler
> Reporter: Bill Farner
> Priority: Minor
>
> The executor is currently autonomous in deciding to kill tasks that have
> failed health checks. If health check failures synchronize across a service,
> the service could suffer an outage. SLA considerations may also need to be
> me made before deciding to kill a task for health check failures.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)