[
https://issues.apache.org/jira/browse/MESOS-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000234#comment-14000234
]
Charlie Carson commented on MESOS-695:
--------------------------------------
Moving this back to Open since I'm not doing any work on it this quarter.
> Introduce automated self-healing and coordinated repair to Mesos
> ----------------------------------------------------------------
>
> Key: MESOS-695
> URL: https://issues.apache.org/jira/browse/MESOS-695
> Project: Mesos
> Issue Type: Task
> Components: master
> Reporter: Jeff Currier
> Assignee: Charlie Carson
>
> One capability that is presently missing within the Mesos framework is the
> ability for the system to self-heal. Specifically, the ability for a master
> to detect something is amiss with a particular host and then to attempt to
> heal that host through a set of automated corrective actions such as:
> 1) restarting process on the suspect node
> 2) rebooting the node
> 3) reimaging the node
> 4) blacklisting node from future scheduled work
> By adding in this capability and informing schedulers of the behavior of the
> hosts within the system it's believed that we can get Mesos to function in
> more of a, 'lights out' mode thereby reducing the OpEx costs for running the
> system today.
> It should be noted that a certain amount of coordination will be required in
> order to ensure that we don't, 'repair" too many nodes at the same time.
> This logic will need to be centralized and such that there is a central
> authority who is elected to make these decisions.
--
This message was sent by Atlassian JIRA
(v6.2#6252)