[ 
https://issues.apache.org/jira/browse/MESOS-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000234#comment-14000234
 ] 

Charlie Carson commented on MESOS-695:
--------------------------------------

Moving this back to Open since I'm not doing any work on it this quarter.

> Introduce automated self-healing and coordinated repair to Mesos
> ----------------------------------------------------------------
>
>                 Key: MESOS-695
>                 URL: https://issues.apache.org/jira/browse/MESOS-695
>             Project: Mesos
>          Issue Type: Task
>          Components: master
>            Reporter: Jeff Currier
>            Assignee: Charlie Carson
>
> One capability that is presently missing within the Mesos framework is the 
> ability for the system to self-heal.  Specifically, the ability for a master 
> to detect something is amiss with a particular host and then to attempt to 
> heal that host through a set of automated corrective actions such as:
> 1) restarting process on the suspect node
> 2) rebooting the node
> 3) reimaging the node
> 4) blacklisting node from future scheduled work
> By adding in this capability and informing schedulers of the behavior of the 
> hosts within the system it's believed that we can get Mesos to function in 
> more of a, 'lights out' mode thereby reducing the OpEx costs for running the 
> system today.
> It should be noted that a certain amount of coordination will be required in 
> order to ensure that we don't, 'repair" too many nodes at the same time.  
> This logic will need to be centralized and such that there is a central 
> authority who is elected to make these decisions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to