[ 
https://issues.apache.org/jira/browse/MESOS-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814507#comment-13814507
 ] 

Jeff Currier commented on MESOS-695:
------------------------------------

Hey all,

I wanted to circle back on this.  I've spoken with many of the Twitter SRE's 
now and have gotten agreement that this proposal makes sense.  Given this, I 
will start to break this work up into smaller, manageable chunks that are 
easier to digest by the community.

--Jeff--

> Introduce automated self-healing and coordinated repair to Mesos
> ----------------------------------------------------------------
>
>                 Key: MESOS-695
>                 URL: https://issues.apache.org/jira/browse/MESOS-695
>             Project: Mesos
>          Issue Type: Task
>          Components: master
>            Reporter: Jeff Currier
>
> One capability that is presently missing within the Mesos framework is the 
> ability for the system to self-heal.  Specifically, the ability for a master 
> to detect something is amiss with a particular host and then to attempt to 
> heal that host through a set of automated corrective actions such as:
> 1) restarting process on the suspect node
> 2) rebooting the node
> 3) reimaging the node
> 4) blacklisting node from future scheduled work
> By adding in this capability and informing schedulers of the behavior of the 
> hosts within the system it's believed that we can get Mesos to function in 
> more of a, 'lights out' mode thereby reducing the OpEx costs for running the 
> system today.
> It should be noted that a certain amount of coordination will be required in 
> order to ensure that we don't, 'repair" too many nodes at the same time.  
> This logic will need to be centralized and such that there is a central 
> authority who is elected to make these decisions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to