-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54909/#review159945
-----------------------------------------------------------


Fix it, then Ship it!




I'm going to add this comment block above each of the four `Clock::cancel`s.
```
// Cancel the pending registration timer to avoid spurious attempts
// at reregistration. `Clock::cancel` is idempotent, so this call
// is safe even if no timer is active or pending.
```


src/sched/sched.cpp (lines 335 - 339)
<https://reviews.apache.org/r/54909/#comment230998>

    We should move the `Clock::cancel` down into the `if (master.isSome())` 
block, as that is where the registration loop is started.



src/slave/slave.hpp (line 799)
<https://reviews.apache.org/r/54909/#comment230999>

    Nit: We standardized how we write `reregistered` in comments a while ago.  
No hyphens anymore :)



src/slave/slave.cpp (lines 923 - 924)
<https://reviews.apache.org/r/54909/#comment230997>

    We also need to cancel the agent registration timer here (as mentioned in 
the commit description)


- Joseph Wu


On Dec. 20, 2016, 7:09 p.m., Alex Clemmer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54909/
> -----------------------------------------------------------
> 
> (Updated Dec. 20, 2016, 7:09 p.m.)
> 
> 
> Review request for mesos, Andrew Schwartzmeyer, Daniel Pravat, John Kordich, 
> and Joseph Wu.
> 
> 
> Bugs: MESOS-6803
>     https://issues.apache.org/jira/browse/MESOS-6803
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This issue appears when `HAS_AUTHENTICATION` is not defined. This commit
> will resolve the issue, and fix tests that surfaced it (as well as some
> associated errors that cause them to be flaky when this symbol is not
> defined).
> 
> Currently when a new master is detected and no credential is provided,
> the agent and framework (which have very similar registration code
> paths) will attempt to (re)register after some random initial `delay`,
> to avoid a "thundering herd" problem.  It is hence possible to have
> spurious double-(re)registrations, since a new master could be detected
> after we add the `delay`d registration, but before we execute it.
> 
> In a degenerate case, suppose a single agent has a registration delay
> of one minute.  A master is brought up, to which, the agent successfully
> registers.  Prior to this commit, the agent will still have a scheduled
> registration loop (`doReliableRegistration`) with a backoff factor.
> If the master goes down and a new master is brought up, the agent
> will race against itself (two ongoing loops of `doReliableRegistration`)
> to register with the new master.  If the first loop reaches the new
> master first, authentication will fail and cause the agent to commit
> suicide.
> 
> To resolve this problem, we store the value of the `Timer` returned by
> `delay` in `doReliableRegistration` and cancel it when we have either
> registered, or need to start a new cycle of registration. This solution
> is implemented in both the framework code and the agent code.
> 
> 
> Diffs
> -----
> 
>   src/sched/sched.cpp 1eecb8b9f8d75c123d13e73d3dd25129e6cd93c4 
>   src/slave/slave.hpp 03860b5d0242289034d4574bd36a85ab6fb87a79 
>   src/slave/slave.cpp a7a3a394e5e4b7f40a051663cd70add3890bdf18 
>   src/tests/master_tests.cpp b3b5630899082a69119d2cccb6460766d1352963 
>   src/tests/reservation_tests.cpp ffbb50bdf16fdeb0ad0aa98afbe71c38c784cd71 
> 
> Diff: https://reviews.apache.org/r/54909/diff/
> 
> 
> Testing
> -------
> 
> `make check` and `mesos-tests --gtest_repeat=1000 --gtest_break_on_failure` 
> to catch intermittent failures, which is how we caught the failing test in 
> `reservation_tests.cpp`. Note that this bug was discovered when we added a 
> `delay` to the call to `authenticate` in `slave::detected` (in order to get 
> it to match the behavior of the non-authenticated call to 
> `doReliableRegistration`.
> 
> 
> Thanks,
> 
> Alex Clemmer
> 
>

Reply via email to