----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/54909/#review159945 -----------------------------------------------------------
Fix it, then Ship it! I'm going to add this comment block above each of the four `Clock::cancel`s. ``` // Cancel the pending registration timer to avoid spurious attempts // at reregistration. `Clock::cancel` is idempotent, so this call // is safe even if no timer is active or pending. ``` src/sched/sched.cpp (lines 335 - 339) <https://reviews.apache.org/r/54909/#comment230998> We should move the `Clock::cancel` down into the `if (master.isSome())` block, as that is where the registration loop is started. src/slave/slave.hpp (line 799) <https://reviews.apache.org/r/54909/#comment230999> Nit: We standardized how we write `reregistered` in comments a while ago. No hyphens anymore :) src/slave/slave.cpp (lines 923 - 924) <https://reviews.apache.org/r/54909/#comment230997> We also need to cancel the agent registration timer here (as mentioned in the commit description) - Joseph Wu On Dec. 20, 2016, 7:09 p.m., Alex Clemmer wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/54909/ > ----------------------------------------------------------- > > (Updated Dec. 20, 2016, 7:09 p.m.) > > > Review request for mesos, Andrew Schwartzmeyer, Daniel Pravat, John Kordich, > and Joseph Wu. > > > Bugs: MESOS-6803 > https://issues.apache.org/jira/browse/MESOS-6803 > > > Repository: mesos > > > Description > ------- > > This issue appears when `HAS_AUTHENTICATION` is not defined. This commit > will resolve the issue, and fix tests that surfaced it (as well as some > associated errors that cause them to be flaky when this symbol is not > defined). > > Currently when a new master is detected and no credential is provided, > the agent and framework (which have very similar registration code > paths) will attempt to (re)register after some random initial `delay`, > to avoid a "thundering herd" problem. It is hence possible to have > spurious double-(re)registrations, since a new master could be detected > after we add the `delay`d registration, but before we execute it. > > In a degenerate case, suppose a single agent has a registration delay > of one minute. A master is brought up, to which, the agent successfully > registers. Prior to this commit, the agent will still have a scheduled > registration loop (`doReliableRegistration`) with a backoff factor. > If the master goes down and a new master is brought up, the agent > will race against itself (two ongoing loops of `doReliableRegistration`) > to register with the new master. If the first loop reaches the new > master first, authentication will fail and cause the agent to commit > suicide. > > To resolve this problem, we store the value of the `Timer` returned by > `delay` in `doReliableRegistration` and cancel it when we have either > registered, or need to start a new cycle of registration. This solution > is implemented in both the framework code and the agent code. > > > Diffs > ----- > > src/sched/sched.cpp 1eecb8b9f8d75c123d13e73d3dd25129e6cd93c4 > src/slave/slave.hpp 03860b5d0242289034d4574bd36a85ab6fb87a79 > src/slave/slave.cpp a7a3a394e5e4b7f40a051663cd70add3890bdf18 > src/tests/master_tests.cpp b3b5630899082a69119d2cccb6460766d1352963 > src/tests/reservation_tests.cpp ffbb50bdf16fdeb0ad0aa98afbe71c38c784cd71 > > Diff: https://reviews.apache.org/r/54909/diff/ > > > Testing > ------- > > `make check` and `mesos-tests --gtest_repeat=1000 --gtest_break_on_failure` > to catch intermittent failures, which is how we caught the failing test in > `reservation_tests.cpp`. Note that this bug was discovered when we added a > `delay` to the call to `authenticate` in `slave::detected` (in order to get > it to match the behavior of the non-authenticated call to > `doReliableRegistration`. > > > Thanks, > > Alex Clemmer > >
