David Robinson created AURORA-651:
-------------------------------------

             Summary: perform_maintenance_hosts should not temporarily remove 
machines
                 Key: AURORA-651
                 URL: https://issues.apache.org/jira/browse/AURORA-651
             Project: Aurora
          Issue Type: Task
          Components: Client
            Reporter: David Robinson


The aurora_admin tool provides the following drain/maintenance commands:

- start_maintenance_hosts

    The list of hosts is marked for maintenance, and will be de-prioritized
    from consideration for scheduling.  Note, they are not removed from
    consideration, and may still schedule tasks if resources are very scarce.
    Usually you would mark a larger set of machines for drain, and then do
    them in batches within the larger set, to help drained tasks not land on
    future hosts that will be drained shortly in subsequent batches.

- host_maintenance_status

    Print the drain status of each supplied host.

- perform_maintenance_hosts

    Asks the scheduler to remove any running tasks from the machine and remove 
it
    from service temporarily, perform some action on them, then return the 
machines
    to service.

- end_maintenance_hosts

    The list of hosts is marked as not in a drained state anymore.  This will
    allow normal scheduling to resume on the given list of hosts.

The command that actually drains a machine is the perform_maintenance_hosts 
command, however it only drains a machine *temporarily*. As soon as the machine 
is drained it is placed back into service, thereby allowing tasks to be 
scheduler on it. This default behavior is wrong. The expected workflow is that 
the --post_drain_script option is used and the script is expected to shutdown 
the slave, typically by SSHing in and stopping the mesos process. It's not 
obvious that perform_maintenance_hosts's --post_drain_script must be used along 
with a script to properly drain a machine, and the admin tool does not provide 
any other commands that could be used to drain a machine *and leave it drained*.

The ideal solution is described in AURORA-43.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to