[
https://issues.apache.org/jira/browse/MESOS-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041458#comment-14041458
]
Benjamin Mahler commented on MESOS-1503:
----------------------------------------
Glad to see you think through this carefully. Let's hold off on this change
until MESOS-1529 is resolved as the ping/pong semantics may change as a result.
> Improve slave health checking to prevent rapid widespread slave removals.
> -------------------------------------------------------------------------
>
> Key: MESOS-1503
> URL: https://issues.apache.org/jira/browse/MESOS-1503
> Project: Mesos
> Issue Type: Improvement
> Components: master
> Reporter: Benjamin Mahler
> Assignee: Timothy Chen
> Labels: reliability
>
> Per some discussions with [~tweingartner] and [~vinodkone].
> Currently the master uses a SlaveObserver for each registered slave. Each
> SlaveObserver operates independently and makes decisions about whether the
> slave is healthy.
> The independence of these observers means that in some very rare events (e.g.
> masters are partitioned from 75% of slaves), the master can very rapidly
> remove a large portion of the slaves in the cluster. Ideally such an event
> could be deemed dangerous and throttled accordingly through a more
> intelligent notion of overall cluster health.
> It may be nice to have a single observer that is responsible for health
> checking all the slaves. This will allow us to make safer decisions as to
> when to determine that slaves are unhealthy.
--
This message was sent by Atlassian JIRA
(v6.2#6252)