----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/51653/#review148619 -----------------------------------------------------------
Fix it, then Ship it! src/master/master.cpp (line 5835) <https://reviews.apache.org/r/51653/#comment216077> s/WARNING/INFO/ because this is expected? src/tests/partition_tests.cpp (lines 1983 - 1984) <https://reviews.apache.org/r/51653/#comment216078> I wonder if it confuses users that there are 2 slave unreachable operations scheduled but only 1 slave got removed. - Vinod Kone On Sept. 12, 2016, 4:01 p.m., Neil Conway wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/51653/ > ----------------------------------------------------------- > > (Updated Sept. 12, 2016, 4:01 p.m.) > > > Review request for mesos and Vinod Kone. > > > Bugs: MESOS-5965 > https://issues.apache.org/jira/browse/MESOS-5965 > > > Repository: mesos > > > Description > ------- > > Now that we wait for the agent to be removed from the registry before > stopping the SlaveObserver, it is possible for an agent to fail health > checks multiple times if the registry operation takes longer than > `agent_ping_timeout`. > > This commit updates the master logic to handle this by ignoring health > check failures while the registry operation to mark the agent > unreachable is still in progress. > > > Diffs > ----- > > src/master/master.cpp 1dcce6cd66804990af238176c61aca03bb5c9471 > src/tests/partition_tests.cpp f3142ad8d50daafcdb70ad9dbb2772f8ba30db00 > > Diff: https://reviews.apache.org/r/51653/diff/ > > > Testing > ------- > > make check on OSX and Linux. > > `./src/mesos-tests > --gtest_filter="Strict/PartitionTest.FailHealthChecksTwice/0" > --gtest_repeat=1000 --gtest_break_on_failure` > > > Thanks, > > Neil Conway > >