[ 
https://issues.apache.org/jira/browse/MESOS-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13768482#comment-13768482
 ] 

Jeff Currier commented on MESOS-695:
------------------------------------

That works.  I'm working with the Mesos team here at Twitter on this
feature already so that works for me.

Thanks Ben!




                
> Introduce automated self-healing and coordinated repair to Mesos
> ----------------------------------------------------------------
>
>                 Key: MESOS-695
>                 URL: https://issues.apache.org/jira/browse/MESOS-695
>             Project: Mesos
>          Issue Type: Task
>          Components: master
>            Reporter: Jeff Currier
>
> One capability that is presently missing within the Mesos framework is the 
> ability for the system to self-heal.  Specifically, the ability for a master 
> to detect something is amiss with a particular host and then to attempt to 
> heal that host through a set of automated corrective actions such as:
> 1) restarting process on the suspect node
> 2) rebooting the node
> 3) reimaging the node
> 4) blacklisting node from future scheduled work
> By adding in this capability and informing schedulers of the behavior of the 
> hosts within the system it's believed that we can get Mesos to function in 
> more of a, 'lights out' mode thereby reducing the OpEx costs for running the 
> system today.
> It should be noted that a certain amount of coordination will be required in 
> order to ensure that we don't, 'repair" too many nodes at the same time.  
> This logic will need to be centralized and such that there is a central 
> authority who is elected to make these decisions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to