Benjamin Mahler created MESOS-1503:
--------------------------------------

             Summary: Improve slave health checking to prevent rapid widespread 
slave removals.
                 Key: MESOS-1503
                 URL: https://issues.apache.org/jira/browse/MESOS-1503
             Project: Mesos
          Issue Type: Improvement
          Components: master
            Reporter: Benjamin Mahler


Per some discussions with [~tweingartner] and [~vinodkone].

Currently the master uses a SlaveObserver for each registered slave. Each 
SlaveObserver operates independently and makes decisions about whether the 
slave is healthy.

The independence of these observers means that in some very rare events (e.g. 
masters are partitioned from 75% of slaves), the master can very rapidly remove 
a large portion of the slaves in the cluster. Ideally such an event could be 
deemed dangerous and throttled accordingly through a more intelligent notion of 
overall cluster health.

It may be nice to have a single observer that is responsible for health 
checking all the slaves. This will allow us to make safer decisions as to when 
to determine that slaves are unhealthy.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to