----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/53239/ -----------------------------------------------------------
(Updated Dec. 19, 2016, 9:15 p.m.) Review request for mesos and Vinod Kone. Changes ------- Tweak test. Bugs: MESOS-5396 https://issues.apache.org/jira/browse/MESOS-5396 Repository: mesos Description ------- A retired agent ID will never attempt to re-register in the future; moreover, any tasks/executors being managed by that agent ID are no longer running. We can take advantage of this knowledge to avoid waiting for `agent_reregister_timeout` to expire after master failover. This is particularly important when agent removal rate-limiting is in use: if a power failure causes the master to fail at the same time that many agent hosts lose power, when power returns the master will failover and all the agents will register anew and receive new agent IDs. With agent removal rate-limiting, it may take a long time for the master to mark all the old agent IDs as unreachable; in the meantime, explicit reconciliation will not return any results, potentially leaving frameworks in limbo for an extended period. Note that we currently mark retired agents as unreachable; in the near future, that will change to marking such agents "gone", once support for that feature is completed. Diffs (updated) ----- src/master/master.hpp 89b3c394b268a8645885412aeb19980db8d73c69 src/master/master.cpp b664550d57ef9805bd23ea35ca7f9cd8f4b1ab78 src/tests/slave_recovery_tests.cpp 5b86c06803c59427c826b1b7039a5156a58e141b Diff: https://reviews.apache.org/r/53239/diff/ Testing ------- `make check` NOTE: Current implementation reuses `Master::markUnreachableAfterFailover`, which means we emit misleading log messages and increment the wrong metrics. Will adjust based on initial review comments. Thanks, Neil Conway