Benjamin Mahler created MESOS-2423:
--------------------------------------
Summary: Re-use slave observer for removing slaves during master
failover.
Key: MESOS-2423
URL: https://issues.apache.org/jira/browse/MESOS-2423
Project: Mesos
Issue Type: Improvement
Components: master
Reporter: Benjamin Mahler
Currently the master uses a timeout to remove slaves that don't re-register
after a failover if a strict registry is used (this is disabled by default for
now). This differs from the steady state case of an unhealthy slave (where the
slave is only removed if it cannot health check).
As a result, the failover case is more prone to removing slaves. This was the
reason for the high timeout (10 minutes), as well as the safety net flag
{{\-\-recovery_slave_removal_limit}} to bail if we see too high a percentage of
slaves not re-registering in time.
For example, if ZK is down during the master failover, the slaves will not
re-register and the master would try to remove them (unless
{{\-\-recovery_slave_removal_limit}} is violated). Whereas, in the steady state
case the slaves will continue to health check and will not be removed.
This is a bit tricky to change as {{SlaveInfo}} does not currently provide the
{{PID}}, and so a recovered master does not have enough information to health
check the slave.
It's not clear yet how this will evolve as we move towards "HTTP" APIs
internally.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)