Joe Smith created AURORA-1514:
---------------------------------
Summary: Allow users to give guidance on SLA for their job
Key: AURORA-1514
URL: https://issues.apache.org/jira/browse/AURORA-1514
Project: Aurora
Issue Type: Story
Components: Maintenance, SRE
Reporter: Joe Smith
There needs to be a standard process for customizing the SLA used to validate a
task on a host can be killed to drain that host into maintenance. Right now,
the default is [95% over
30minutes|https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/admin/admin_util.py#L35],
but there are certain services (such as memcache) which would be able to
survive much better under a 99% over 5 minutes, for example.
We could build this tooling [around the existing {{aurora_admin
drain_hosts}}|https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/admin/admin_util.py#L75],
but it would apply to all tasks on that host, which would increase complexity.
Lastly, in case we decide to make this user-settable vs.
operator-whitelistable.. t is important that we still set firm barriers in
place around acceptable values to prevent a service from setting 100% over 0
minutes and holding hosts hostage.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)