Benjamin Mahler created MESOS-2423:
--------------------------------------

             Summary: Re-use slave observer for removing slaves during master 
failover.
                 Key: MESOS-2423
                 URL: https://issues.apache.org/jira/browse/MESOS-2423
             Project: Mesos
          Issue Type: Improvement
          Components: master
            Reporter: Benjamin Mahler


Currently the master uses a timeout to remove slaves that don't re-register 
after a failover if a strict registry is used (this is disabled by default for 
now). This differs from the steady state case of an unhealthy slave (where the 
slave is only removed if it cannot health check).

As a result, the failover case is more prone to removing slaves. This was the 
reason for the high timeout (10 minutes), as well as the safety net flag 
{{\-\-recovery_slave_removal_limit}} to bail if we see too high a percentage of 
slaves not re-registering in time.

For example, if ZK is down during the master failover, the slaves will not 
re-register and the master would try to remove them (unless 
{{\-\-recovery_slave_removal_limit}} is violated). Whereas, in the steady state 
case the slaves will continue to health check and will not be removed.

This is a bit tricky to change as {{SlaveInfo}} does not currently provide the 
{{PID}}, and so a recovered master does not have enough information to health 
check the slave.

It's not clear yet how this will evolve as we move towards "HTTP" APIs 
internally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to