----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/53239/#review155028 -----------------------------------------------------------
src/master/master.cpp (lines 5144 - 5148) <https://reviews.apache.org/r/53239/#comment224855> If the slave thinks an ID needs to be retired, I think it makes sense for the master to send a shutdown to that slave and remove it? If we just ignore it how is the situation going to resolve? src/master/master.cpp (line 5165) <https://reviews.apache.org/r/53239/#comment224859> why are we marking it unreachable instead of removing it directly? src/master/master.cpp (line 5170) <https://reviews.apache.org/r/53239/#comment224857> Retirement src/master/master.cpp (line 5187) <https://reviews.apache.org/r/53239/#comment224856> what's the plan for including old agent ids when the agent doesn't reboot? also, should we rename it to "old_slave_ids" instead of "retired_slave_ids" ? - Vinod Kone On Oct. 27, 2016, 10:46 p.m., Neil Conway wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/53239/ > ----------------------------------------------------------- > > (Updated Oct. 27, 2016, 10:46 p.m.) > > > Review request for mesos and Vinod Kone. > > > Bugs: MESOS-5396 > https://issues.apache.org/jira/browse/MESOS-5396 > > > Repository: mesos > > > Description > ------- > > A retired agent ID will never attempt to re-register in the future; > moreover, any tasks/executors being managed by that agent ID are no > longer running. We can take advantage of this knowledge to avoid waiting > for `agent_reregister_timeout` to expire after master failover. > > This is particularly important when agent removal rate-limiting is in > use: if a power failure causes the master to fail at the same time that > many agent hosts lose power, when power returns the master will failover > and all the agents will register anew and receive new agent IDs. With > agent removal rate-limiting, it may take a long time for the master to > mark all the old agent IDs as unreachable; in the meantime, explicit > reconciliation will not return any results, potentially leaving > frameworks in limbo for an extended period. > > Note that we currently mark retired agents as unreachable; in the near > future, that will change to marking such agents "gone", once support for > that feature is completed. > > > Diffs > ----- > > src/master/master.hpp 87186c6e733a686f96528b1722fda1c287e9c881 > src/master/master.cpp 8692726d21812827f9e1fd9093d80fd260588ecb > src/tests/slave_recovery_tests.cpp 65fc18bc2732dc53581d39ee23368e347f0b2ca4 > > Diff: https://reviews.apache.org/r/53239/diff/ > > > Testing > ------- > > `make check` > > NOTE: Current implementation reuses `Master::markUnreachableAfterFailover`, > which means we emit misleading log messages and increment the wrong metrics. > Will adjust based on initial review comments. > > > Thanks, > > Neil Conway > >
