[
https://issues.apache.org/jira/browse/MESOS-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13768508#comment-13768508
]
David Robinson commented on MESOS-695:
--------------------------------------
TBH I'm not sure this really belongs in Mesos. The questions I'd ask are:
1) How do you define "something is amiss".
2) How do you detect if "something is amiss".
3) How do you know what the correct action to take is (restart a process vs
reimage a host)
4) How do you know what number of hosts to repair is too many
5) How do you repair hosts
6) How do you reimage hosts
Twitter have tools for all of these tasks already (1 and 2 are covered by our
observability team, and 3, 4 and 5 would be covered by an internal tool called
servermaint).
I suspect that if you try and solve these problems from within Mesos you'll
reinvent a lot of wheels and alienate a lot of people. Most people using Mesos
would already have an observability stack (so could answer questions 1 and 2).
Questions 3, 4 and 5 are business logic, and most people would already have a
provisioning system (question 6).
What you need to solve the problem can be implemented without any changes to
Mesos core. Rather than add this to Mesos core you'd be better off building
something on top. eg, have a separate tool that detects "something is amiss"
(an observability stack), and takes corrective action. Essentially what they
want is something like servermaint.
> Introduce automated self-healing and coordinated repair to Mesos
> ----------------------------------------------------------------
>
> Key: MESOS-695
> URL: https://issues.apache.org/jira/browse/MESOS-695
> Project: Mesos
> Issue Type: Task
> Components: master
> Reporter: Jeff Currier
>
> One capability that is presently missing within the Mesos framework is the
> ability for the system to self-heal. Specifically, the ability for a master
> to detect something is amiss with a particular host and then to attempt to
> heal that host through a set of automated corrective actions such as:
> 1) restarting process on the suspect node
> 2) rebooting the node
> 3) reimaging the node
> 4) blacklisting node from future scheduled work
> By adding in this capability and informing schedulers of the behavior of the
> hosts within the system it's believed that we can get Mesos to function in
> more of a, 'lights out' mode thereby reducing the OpEx costs for running the
> system today.
> It should be noted that a certain amount of coordination will be required in
> order to ensure that we don't, 'repair" too many nodes at the same time.
> This logic will need to be centralized and such that there is a central
> authority who is elected to make these decisions.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira