[ 
https://issues.apache.org/jira/browse/AURORA-279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974658#comment-14974658
 ] 

Brian Weber commented on AURORA-279:
------------------------------------

It shouldn't be too much to ask for a guard rail to prevent a health check 
reaction from taking down an entire job. It also doesn't look like a huge 
addition to add an integer for max_concurrent_restarts or something like that 
(perhaps default to a batch size?) to permit customers who don't have central 
remediation frameworks to allow aurora to manage the failure rates.

e.g.: a job with 1000 instances can serve well enough with 10% (100 instances) 
down. Let's suppose a bug running wild and instances arbitrarily start 
responding unhealthy. If a restart temporarily fixes the bug until the next 
deploy, cool. If the bug hits enough instances that between the bug and the 
existing restarts that over 100 instances are down, then the configured health 
check would take down enough instances that the service would potentially stop 
serving well at all. 

Suppose instead, thermos queried aurora for permission to remediate, and aurora 
could then ratelimit remediations and send a notification to someone so they 
can respond more immediately. Aurora can then know that 10% of the fleet is 
down, and hold off while a human is notified". It would then be up to the 
notified party to decide whether to fix the bug right there.

- It may be 3am when nobody is awake, so the action may be to just restart the 
entire job.
- It may be a low traffic point, in which case one may decide to adjust the 
threshold.
- It may be a critical time because the entire site is on fire, and only one 
service is less important.
- It may be important enough that the decision is made to push a bugfix right 
then and there, which is not always an easy task.

The only action in thermos would be to query aurora for permission, which would 
be a boolean response. The only action in aurora would be to compare number of 
not-healthy instances to a ratelimit (e.g., if not_serving_instances > 
rate_limit: return False). This doesn't seem too complicated to build in and 
would give aurora a great bit of repair power.

> Allow scheduler to decide how to respond to task health check failures
> ----------------------------------------------------------------------
>
>                 Key: AURORA-279
>                 URL: https://issues.apache.org/jira/browse/AURORA-279
>             Project: Aurora
>          Issue Type: Story
>          Components: Executor, Scheduler
>            Reporter: Bill Farner
>            Priority: Minor
>
> The executor is currently autonomous in deciding to kill tasks that have 
> failed health checks.  If health check failures synchronize across a service, 
> the service could suffer an outage.  SLA considerations may also need to be 
> me made before deciding to kill a task for health check failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to