[ https://issues.apache.org/jira/browse/MESOS-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13768482#comment-13768482 ]
Jeff Currier commented on MESOS-695: ------------------------------------ That works. I'm working with the Mesos team here at Twitter on this feature already so that works for me. Thanks Ben! > Introduce automated self-healing and coordinated repair to Mesos > ---------------------------------------------------------------- > > Key: MESOS-695 > URL: https://issues.apache.org/jira/browse/MESOS-695 > Project: Mesos > Issue Type: Task > Components: master > Reporter: Jeff Currier > > One capability that is presently missing within the Mesos framework is the > ability for the system to self-heal. Specifically, the ability for a master > to detect something is amiss with a particular host and then to attempt to > heal that host through a set of automated corrective actions such as: > 1) restarting process on the suspect node > 2) rebooting the node > 3) reimaging the node > 4) blacklisting node from future scheduled work > By adding in this capability and informing schedulers of the behavior of the > hosts within the system it's believed that we can get Mesos to function in > more of a, 'lights out' mode thereby reducing the OpEx costs for running the > system today. > It should be noted that a certain amount of coordination will be required in > order to ensure that we don't, 'repair" too many nodes at the same time. > This logic will need to be centralized and such that there is a central > authority who is elected to make these decisions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira