-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54909/
-----------------------------------------------------------

(Updated Dec. 20, 2016, 3:18 p.m.)


Review request for mesos, Andrew Schwartzmeyer, Daniel Pravat, John Kordich, 
and Joseph Wu.


Changes
-------

Discussed this case with Alex and fleshed out the commit message.  There is a 
potential failure case that this timer + cancel logic will prevent -- Joseph Wu.


Bugs: MESOS-6803
    https://issues.apache.org/jira/browse/MESOS-6803


Repository: mesos


Description (updated)
-------

Currently when a new master is detected and no credential is provided,
the agent will attempt to (re)register after some random initial
`delay`, to avoid a "thundering herd" problem.  It is hence possible
to have spurious double-(re)registrations, since a new master could 
be detected after we add the `delay`d registration, but before we 
execute it.

In a degenerate case, suppose a single agent has a registration delay
of one minute.  A master is brought up, to which, the agent successfully
registers.  Prior to this commit, the agent will still have a scheduled
registration loop (`doReliableRegistration`) with a backoff factor.
If the master goes down and a new master is brought up, the agent 
will race against itself (two ongoing loops of `doReliableRegistration`)
to register with the new master.  If the first loop reaches the new 
master first, authentication will fail and cause the agent to commit
suicide.

To resolve this problem, we store the value of the `Timer` returned by
`delay` in `doReliableRegistration` and cancel it when we have either
registered, or need to start a new cycle of registration.


Diffs
-----

  src/slave/slave.hpp 03860b5d0242289034d4574bd36a85ab6fb87a79 
  src/slave/slave.cpp a7a3a394e5e4b7f40a051663cd70add3890bdf18 
  src/tests/reservation_tests.cpp ffbb50bdf16fdeb0ad0aa98afbe71c38c784cd71 

Diff: https://reviews.apache.org/r/54909/diff/


Testing
-------

`make check` and `mesos-tests --gtest_repeat=1000 --gtest_break_on_failure` to 
catch intermittent failures, which is how we caught the failing test in 
`reservation_tests.cpp`. Note that this bug was discovered when we added a 
`delay` to the call to `authenticate` in `slave::detected` (in order to get it 
to match the behavior of the non-authenticated call to `doReliableRegistration`.


Thanks,

Alex Clemmer

Reply via email to