David Robinson created AURORA-651:
-------------------------------------
Summary: perform_maintenance_hosts should not temporarily remove
machines
Key: AURORA-651
URL: https://issues.apache.org/jira/browse/AURORA-651
Project: Aurora
Issue Type: Task
Components: Client
Reporter: David Robinson
The aurora_admin tool provides the following drain/maintenance commands:
- start_maintenance_hosts
The list of hosts is marked for maintenance, and will be de-prioritized
from consideration for scheduling. Note, they are not removed from
consideration, and may still schedule tasks if resources are very scarce.
Usually you would mark a larger set of machines for drain, and then do
them in batches within the larger set, to help drained tasks not land on
future hosts that will be drained shortly in subsequent batches.
- host_maintenance_status
Print the drain status of each supplied host.
- perform_maintenance_hosts
Asks the scheduler to remove any running tasks from the machine and remove
it
from service temporarily, perform some action on them, then return the
machines
to service.
- end_maintenance_hosts
The list of hosts is marked as not in a drained state anymore. This will
allow normal scheduling to resume on the given list of hosts.
The command that actually drains a machine is the perform_maintenance_hosts
command, however it only drains a machine *temporarily*. As soon as the machine
is drained it is placed back into service, thereby allowing tasks to be
scheduler on it. This default behavior is wrong. The expected workflow is that
the --post_drain_script option is used and the script is expected to shutdown
the slave, typically by SSHing in and stopping the mesos process. It's not
obvious that perform_maintenance_hosts's --post_drain_script must be used along
with a script to properly drain a machine, and the admin tool does not provide
any other commands that could be used to drain a machine *and leave it drained*.
The ideal solution is described in AURORA-43.
--
This message was sent by Atlassian JIRA
(v6.2#6252)