[
https://issues.apache.org/jira/browse/MESOS-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Artem Harutyunyan updated MESOS-1474:
-------------------------------------
Labels: mesosphere twitter (was: twitter)
> Provide cluster maintenance primitives for operators.
> -----------------------------------------------------
>
> Key: MESOS-1474
> URL: https://issues.apache.org/jira/browse/MESOS-1474
> Project: Mesos
> Issue Type: Epic
> Components: framework, master, slave
> Reporter: Benjamin Mahler
> Labels: mesosphere, twitter
>
> Sometimes operators need to perform maintenance on a mesos cluster; we define
> maintenance here as anything that requires the tasks to be drained on the
> slave(s). Most mesos upgrades can be done without affecting running tasks,
> but there are situations where maintenance is task-affecting:
> * Host maintenance (e.g. hardware repair, kernel upgrades).
> * Non-recoverable slave upgrades (e.g. adjusting slave attributes).
> * etc
> In order to ensure operators don’t violate frameworks’ SLAs, schedulers need
> to be aware of planned unavailability events.
> Maintenance awareness allows schedulers to avoid churn for long running tasks
> by placing them on machines not undergoing maintenance. If all resources are
> planned for maintenance, then the scheduler will prefer machines scheduled
> for maintenance least imminently.
> Maintenance awareness is also crucial when a scheduler uses [persistent
> disk|https://issues.apache.org/jira/browse/MESOS-1554] resources, to ensure
> that the scheduler is aware of the expected duration of unavailability for a
> persistent disk resource (e.g. using 3 1TB replicas, don’t need to replicate
> 1TB over the network when only 1 of the 3 replicas is going to be unavailable
> for a reboot (< 1 hour)).
> There are a few primitives of interest here:
> * Provide a way for operators to [fully shutdown a
> slave|https://issues.apache.org/jira/browse/MESOS-1475] (killing all tasks
> underneath it). Colloquially known as a "hard drain".
> * Provide a way for operators to mark specific slaves as scheduled for
> maintenance. This will inform the scheduler about the scheduled
> unavailability of the resources.
> * Provide a way for frameworks to be notified when resources are requested to
> be relinquished. This gives the framework to proactively move a task before
> it may be forcibly killed by an operator. It also allows the automation of
> operations like: "please drain these slaves within 1 hour."
> See the [design
> doc|https://docs.google.com/a/twitter.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit#]
> for the latest details.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)