-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54909/
-----------------------------------------------------------
(Updated Dec. 21, 2016, 3:09 a.m.)
Review request for mesos, Andrew Schwartzmeyer, Daniel Pravat, John Kordich,
and Joseph Wu.
Changes
-------
Let's fix the spurious framework registration bug while we're at it! :)
Summary (updated)
-----------------
Fixed spurious registration bug in framework and agent.
Bugs: MESOS-6803
https://issues.apache.org/jira/browse/MESOS-6803
Repository: mesos
Description (updated)
-------
This issue appears when `HAS_AUTHENTICATION` is not defined. This commit
will resolve the issue, and fix tests that surfaced it (as well as some
associated errors that cause them to be flaky when this symbol is not
defined).
Currently when a new master is detected and no credential is provided,
the agent and framework (which have very similar registration code
paths) will attempt to (re)register after some random initial `delay`,
to avoid a "thundering herd" problem. It is hence possible to have
spurious double-(re)registrations, since a new master could be detected
after we add the `delay`d registration, but before we execute it.
In a degenerate case, suppose a single agent has a registration delay
of one minute. A master is brought up, to which, the agent successfully
registers. Prior to this commit, the agent will still have a scheduled
registration loop (`doReliableRegistration`) with a backoff factor.
If the master goes down and a new master is brought up, the agent
will race against itself (two ongoing loops of `doReliableRegistration`)
to register with the new master. If the first loop reaches the new
master first, authentication will fail and cause the agent to commit
suicide.
To resolve this problem, we store the value of the `Timer` returned by
`delay` in `doReliableRegistration` and cancel it when we have either
registered, or need to start a new cycle of registration. This solution
is implemented in both the framework code and the agent code.
Diffs (updated)
-----
src/sched/sched.cpp 1eecb8b9f8d75c123d13e73d3dd25129e6cd93c4
src/slave/slave.hpp 03860b5d0242289034d4574bd36a85ab6fb87a79
src/slave/slave.cpp a7a3a394e5e4b7f40a051663cd70add3890bdf18
src/tests/master_tests.cpp b3b5630899082a69119d2cccb6460766d1352963
src/tests/reservation_tests.cpp ffbb50bdf16fdeb0ad0aa98afbe71c38c784cd71
Diff: https://reviews.apache.org/r/54909/diff/
Testing
-------
`make check` and `mesos-tests --gtest_repeat=1000 --gtest_break_on_failure` to
catch intermittent failures, which is how we caught the failing test in
`reservation_tests.cpp`. Note that this bug was discovered when we added a
`delay` to the call to `authenticate` in `slave::detected` (in order to get it
to match the behavior of the non-authenticated call to `doReliableRegistration`.
Thanks,
Alex Clemmer