----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71742/#review218653 -----------------------------------------------------------
Chatted on https://mesos.slack.com/archives/C8PDVCDE3/p1573688798006800. The approach LGTM. Let's fix the tests. :) src/slave/slave.cpp Lines 6415 (patched) <https://reviews.apache.org/r/71742/#comment306509> s/respond/respond to/ But the comment is basically reiterating the behavior which is pretty straightforward. We can add a bit of rationale for the behavior here: ``` // Don't respond to pings if the agent cannot detect the master (e.g., due to failing to connect to ZK) because it stops responding to control messages such as run/kill tasks. It's better to have the master eventually mark the agent as partitioned after prolonged ZK disconnection than to silently drop messages. ``` src/slave/slave.cpp Lines 6419 (patched) <https://reviews.apache.org/r/71742/#comment306510> "registered master" is not a valid concept here, let's say "because the master cannot be detected"? - Jiang Yan Xu On Nov. 13, 2019, 2:39 p.m., Xudong Ni wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71742/ > ----------------------------------------------------------- > > (Updated Nov. 13, 2019, 2:39 p.m.) > > > Review request for mesos and Jiang Yan Xu. > > > Bugs: MESOS-10032 > https://issues.apache.org/jira/browse/MESOS-10032 > > > Repository: mesos > > > Description > ------- > > In the case agents lost ZooKeeper connections and resetting its > master to none and beginning to dropping control messages from the > master, agent should not respond pings from master. > > > Diffs > ----- > > src/slave/slave.cpp 3839a120446339fea8aa857f431a2dba28ed4002 > > > Diff: https://reviews.apache.org/r/71742/diff/2/ > > > Testing > ------- > > ==========] 2322 tests from 222 test cases ran. (1038166 ms total) > [ PASSED ] 2321 tests. > [ FAILED ] 1 test, listed below: > [ FAILED ] SlaveRecoveryTest/0.PingTimeoutDuringRecovery, where TypeParam = > mesos::internal::slave::MesosContainerizer > > This failed test verifies that the agent responds to pings from the master > while the agent is performing recovery, this PR will break this scenario. > > > Thanks, > > Xudong Ni > >