[ 
https://issues.apache.org/jira/browse/AURORA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Khutornenko reassigned AURORA-651:
----------------------------------------

    Assignee: Maxim Khutornenko

> perform_maintenance_hosts should not temporarily remove machines
> ----------------------------------------------------------------
>
>                 Key: AURORA-651
>                 URL: https://issues.apache.org/jira/browse/AURORA-651
>             Project: Aurora
>          Issue Type: Task
>          Components: Client
>            Reporter: David Robinson
>            Assignee: Maxim Khutornenko
>
> The aurora_admin tool provides the following drain/maintenance commands:
> - start_maintenance_hosts
>     The list of hosts is marked for maintenance, and will be de-prioritized
>     from consideration for scheduling.  Note, they are not removed from
>     consideration, and may still schedule tasks if resources are very scarce.
>     Usually you would mark a larger set of machines for drain, and then do
>     them in batches within the larger set, to help drained tasks not land on
>     future hosts that will be drained shortly in subsequent batches.
> - host_maintenance_status
>     Print the drain status of each supplied host.
> - perform_maintenance_hosts
>     Asks the scheduler to remove any running tasks from the machine and 
> remove it
>     from service temporarily, perform some action on them, then return the 
> machines
>     to service.
> - end_maintenance_hosts
>     The list of hosts is marked as not in a drained state anymore.  This will
>     allow normal scheduling to resume on the given list of hosts.
> The command that actually drains a machine is the perform_maintenance_hosts 
> command, however it only drains a machine *temporarily*. As soon as the 
> machine is drained it is placed back into service, thereby allowing tasks to 
> be scheduler on it. This default behavior is wrong. The expected workflow is 
> that the --post_drain_script option is used and the script is expected to 
> shutdown the slave, typically by SSHing in and stopping the mesos process. 
> It's not obvious that perform_maintenance_hosts's --post_drain_script must be 
> used along with a script to properly drain a machine, and the admin tool does 
> not provide any other commands that could be used to drain a machine *and 
> leave it drained*.
> The ideal solution is described in AURORA-43.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to