Joe Smith created AURORA-1514:
---------------------------------

             Summary: Allow users to give guidance on SLA for their job
                 Key: AURORA-1514
                 URL: https://issues.apache.org/jira/browse/AURORA-1514
             Project: Aurora
          Issue Type: Story
          Components: Maintenance, SRE
            Reporter: Joe Smith


There needs to be a standard process for customizing the SLA used to validate a 
task on a host can be killed to drain that host into maintenance. Right now, 
the default is [95% over 
30minutes|https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/admin/admin_util.py#L35],
 but there are certain services (such as memcache) which would be able to 
survive much better under a 99% over 5 minutes, for example.

We could build this tooling [around the existing {{aurora_admin 
drain_hosts}}|https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/admin/admin_util.py#L75],
 but it would apply to all tasks on that host, which would increase complexity.

Lastly, in case we decide to make this user-settable vs. 
operator-whitelistable.. t is important that we still set firm barriers in 
place around acceptable values to prevent a service from setting 100% over 0 
minutes and holding hosts hostage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to