Jeff Currier created MESOS-695:
----------------------------------

             Summary: Introduce automated self-healing and coordinated repair 
to Mesos
                 Key: MESOS-695
                 URL: https://issues.apache.org/jira/browse/MESOS-695
             Project: Mesos
          Issue Type: Task
          Components: master
            Reporter: Jeff Currier


One capability that is presently missing within the Mesos framework is the 
ability for the system to self-heal.  Specifically, the ability for a master to 
detect something is amiss with a particular host and then to attempt to heal 
that host through a set of automated corrective actions such as:

1) restarting process on the suspect node
2) rebooting the node
3) reimaging the node
4) blacklisting node from future scheduled work

By adding in this capability and informing schedulers of the behavior of the 
hosts within the system it's believed that we can get Mesos to function in more 
of a, 'lights out' mode thereby reducing the OpEx costs for running the system 
today.

It should be noted that a certain amount of coordination will be required in 
order to ensure that we don't, 'repair" too many nodes at the same time.  This 
logic will need to be centralized and such that there is a central authority 
who is elected to make these decisions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to