[
https://issues.apache.org/jira/browse/MESOS-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13768539#comment-13768539
]
Jeff Currier commented on MESOS-695:
------------------------------------
David,
For #1 & #2, not all of the observability components are available at the
Mesos level. We will need to gather this signal in order to solve #1 & #2
and sort out exactly what is taking place. In terms of the what, we are
looking for threshold violations that we define over things like HDD smart
diagnostics, and other log data to make the determination on if a node is
heathy or suspicious.
For #3,4,5 -> servermaint requires operator invention to execute it. I
know there are plans in talking with Joseph Smith and others for it to
perhaps run automatically but these capabilities do not yet exist.
What's more, servermaint is a tool which runs along side something like
Mesos. It's not something that's baked into the Mesos platform. The
intention with this item is to incorporate this as a first class concept
within the Mesos along with a formal heath model
In terms of providing guarantee's around not repairing too many nodes at
one time we intend to add a the concept of a repair coordinator which
facilitates precisely this kind of control. It also will decide the
escalating set of repairs to use against a particular node which is in a
unhealthy state.
Finally, most of the repairs short of re-imaging are fairly straight
forward to implement. However, re-imaging is a repair which is useful and
internally in Twitter we will use Wilson for this work. However, in order
to play well with the larger community we will need to provide a set of
abstractions which allow us to plug-in Wilson as one possible provider of
others that we may use.
Hope this helps,
--Jeff--
> Introduce automated self-healing and coordinated repair to Mesos
> ----------------------------------------------------------------
>
> Key: MESOS-695
> URL: https://issues.apache.org/jira/browse/MESOS-695
> Project: Mesos
> Issue Type: Task
> Components: master
> Reporter: Jeff Currier
>
> One capability that is presently missing within the Mesos framework is the
> ability for the system to self-heal. Specifically, the ability for a master
> to detect something is amiss with a particular host and then to attempt to
> heal that host through a set of automated corrective actions such as:
> 1) restarting process on the suspect node
> 2) rebooting the node
> 3) reimaging the node
> 4) blacklisting node from future scheduled work
> By adding in this capability and informing schedulers of the behavior of the
> hosts within the system it's believed that we can get Mesos to function in
> more of a, 'lights out' mode thereby reducing the OpEx costs for running the
> system today.
> It should be noted that a certain amount of coordination will be required in
> order to ensure that we don't, 'repair" too many nodes at the same time.
> This logic will need to be centralized and such that there is a central
> authority who is elected to make these decisions.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira