> On Jan. 9, 2017, 3:17 a.m., Vinod Kone wrote: > > src/master/master.cpp, lines 1368-1370 > > <https://reviews.apache.org/r/55307/diff/1/?file=1599464#file1599464line1368> > > > > just like "health check time out", can this be succinct? maybe > > "re-registration time out"?
I think being a bit more verbose is warranted here -- e.g., pointing out that we previously observed the agent disconnecting, which is why we expected it to re-register. > On Jan. 9, 2017, 3:17 a.m., Vinod Kone wrote: > > src/tests/master_tests.cpp, lines 6105-6117 > > <https://reviews.apache.org/r/55307/diff/1/?file=1599466#file1599466line6105> > > > > Instead of advancing the clock for the recovery to finish, why not just > > not advance it. That way you don't have to do the registration drops either? > > > > also, you only do `detector.appoint` way below; does agent even send > > re-registration messages without that here? i guess it does if the master > > pid didn't change? Per offline discussion, we can't not-advance the clock because we need to advance the clock below to cause `agent_reregister_timeout` to expire. We also need to advance the clock if we want to trigger a master -> agent ping, which is a useful thing to do (since we want to verify that the agent continues to receive and respond to pings without having finished recovery). - Neil ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/55307/#review160859 ----------------------------------------------------------- On Jan. 7, 2017, 9:13 p.m., Neil Conway wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/55307/ > ----------------------------------------------------------- > > (Updated Jan. 7, 2017, 9:13 p.m.) > > > Review request for mesos and Vinod Kone. > > > Bugs: MESOS-6286 > https://issues.apache.org/jira/browse/MESOS-6286 > > > Repository: mesos > > > Description > ------- > > The master expected that if an agent responds to pings, it will > (eventually) register or re-register. However, if the agent hangs during > recovery, that assumption does not hold: the agent will continue to > respond to pings but won't attempt to re-register until recovery > finishes. > > To handle this case, the master now expects an agent to re-register > within `agent_reregister_timeout` if the master -> agent socket breaks; > if no re-registration is seen, the master will mark the agent > unreachable. This is a "backup" to handle the case where recovery hangs, > as explained above; more commonly, the agent will re-register (when it > receives a ping and notices the master believes it is disconnected) or > be marked unreachable because it fails to respond to pings. > > > Diffs > ----- > > docs/configuration.md e4beb2d5a72f1c5f59b2e40f4984cc60b8437d9d > src/master/flags.cpp 737290a42c532f2349009d0a451ce271d6f107b9 > src/master/master.hpp 57fc6e6f2995078df80f0aa52707727db802ede0 > src/master/master.cpp 11c34a048586d30c6ac67be8638ed8fa81cc3f1f > src/slave/slave.cpp f8f2ccfadb9a00be17c0b552586aa5875b7cbb19 > src/tests/master_tests.cpp 1cf4c92b2474e18771459f877b2f3c49077e8a01 > src/tests/slave_tests.cpp d633a74d6b342239fbca0b44eec281eb315df5ff > > Diff: https://reviews.apache.org/r/55307/diff/ > > > Testing > ------- > > `make check` > > Ran new tests a few thousand times on OSX and Linux VM to check for flakiness. > > > Thanks, > > Neil Conway > >
