Benjamin Mahler created MESOS-1474:
--------------------------------------
Summary: Provide cluster maintenance primitives for operators.
Key: MESOS-1474
URL: https://issues.apache.org/jira/browse/MESOS-1474
Project: Mesos
Issue Type: Epic
Components: framework, master, slave
Reporter: Benjamin Mahler
Normally cluster upgrades can be done seamlessly using the built-in slave
recovery feature. However, there are situations where operators want to be able
to perform destructive maintenance operations on machines:
* Non-recoverable slave upgrades.
* Machine reboots.
* Kernel upgrades.
* etc.
In these situations, best practice is to perform rolling maintenance in large
batches of machines. This can be problematic for frameworks when many related
tasks are located within a batch of machines going for maintenance.
There are a few primitives of interest here:
* Provide a way for operators to fully shutdown a slave (killing all tasks
underneath it).
* Provide a way for operators to mark specific slaves as undergoing
maintenance. This means that no more offers are being sent for these slaves,
and no new tasks will launch on them.
* Provide a way for frameworks to be notified when resources are requested to
be relinquished. This gives the framework to proactively move a task before it
is forcibly killed. It also allows the automation of operations like: "please
drain and shutdown these slaves within 1 hour."
--
This message was sent by Atlassian JIRA
(v6.2#6252)