----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/54909/ -----------------------------------------------------------
(Updated Dec. 20, 2016, 3:18 p.m.) Review request for mesos, Andrew Schwartzmeyer, Daniel Pravat, John Kordich, and Joseph Wu. Changes ------- Discussed this case with Alex and fleshed out the commit message. There is a potential failure case that this timer + cancel logic will prevent -- Joseph Wu. Bugs: MESOS-6803 https://issues.apache.org/jira/browse/MESOS-6803 Repository: mesos Description (updated) ------- Currently when a new master is detected and no credential is provided, the agent will attempt to (re)register after some random initial `delay`, to avoid a "thundering herd" problem. It is hence possible to have spurious double-(re)registrations, since a new master could be detected after we add the `delay`d registration, but before we execute it. In a degenerate case, suppose a single agent has a registration delay of one minute. A master is brought up, to which, the agent successfully registers. Prior to this commit, the agent will still have a scheduled registration loop (`doReliableRegistration`) with a backoff factor. If the master goes down and a new master is brought up, the agent will race against itself (two ongoing loops of `doReliableRegistration`) to register with the new master. If the first loop reaches the new master first, authentication will fail and cause the agent to commit suicide. To resolve this problem, we store the value of the `Timer` returned by `delay` in `doReliableRegistration` and cancel it when we have either registered, or need to start a new cycle of registration. Diffs ----- src/slave/slave.hpp 03860b5d0242289034d4574bd36a85ab6fb87a79 src/slave/slave.cpp a7a3a394e5e4b7f40a051663cd70add3890bdf18 src/tests/reservation_tests.cpp ffbb50bdf16fdeb0ad0aa98afbe71c38c784cd71 Diff: https://reviews.apache.org/r/54909/diff/ Testing ------- `make check` and `mesos-tests --gtest_repeat=1000 --gtest_break_on_failure` to catch intermittent failures, which is how we caught the failing test in `reservation_tests.cpp`. Note that this bug was discovered when we added a `delay` to the call to `authenticate` in `slave::detected` (in order to get it to match the behavior of the non-authenticated call to `doReliableRegistration`. Thanks, Alex Clemmer