Jeff Currier created MESOS-695:
----------------------------------
Summary: Introduce automated self-healing and coordinated repair
to Mesos
Key: MESOS-695
URL: https://issues.apache.org/jira/browse/MESOS-695
Project: Mesos
Issue Type: Task
Components: master
Reporter: Jeff Currier
One capability that is presently missing within the Mesos framework is the
ability for the system to self-heal. Specifically, the ability for a master to
detect something is amiss with a particular host and then to attempt to heal
that host through a set of automated corrective actions such as:
1) restarting process on the suspect node
2) rebooting the node
3) reimaging the node
4) blacklisting node from future scheduled work
By adding in this capability and informing schedulers of the behavior of the
hosts within the system it's believed that we can get Mesos to function in more
of a, 'lights out' mode thereby reducing the OpEx costs for running the system
today.
It should be noted that a certain amount of coordination will be required in
order to ensure that we don't, 'repair" too many nodes at the same time. This
logic will need to be centralized and such that there is a central authority
who is elected to make these decisions.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira