-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71742/#review218653
-----------------------------------------------------------



Chatted on https://mesos.slack.com/archives/C8PDVCDE3/p1573688798006800.

The approach LGTM. Let's fix the tests. :)


src/slave/slave.cpp
Lines 6415 (patched)
<https://reviews.apache.org/r/71742/#comment306509>

    s/respond/respond to/
    
    But the comment is basically reiterating the behavior which is pretty 
straightforward. We can add a bit of rationale for the behavior here: 
    
    ```
    // Don't respond to pings if the agent cannot detect the master (e.g., due 
to failing to connect to ZK) because it stops responding to control messages 
such as run/kill tasks. It's better to have the master eventually mark the 
agent as partitioned after prolonged ZK disconnection than to silently drop 
messages.
    ```



src/slave/slave.cpp
Lines 6419 (patched)
<https://reviews.apache.org/r/71742/#comment306510>

    "registered master" is not a valid concept here, let's say "because the 
master cannot be detected"?


- Jiang Yan Xu


On Nov. 13, 2019, 2:39 p.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71742/
> -----------------------------------------------------------
> 
> (Updated Nov. 13, 2019, 2:39 p.m.)
> 
> 
> Review request for mesos and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-10032
>     https://issues.apache.org/jira/browse/MESOS-10032
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> In the case agents lost ZooKeeper connections and resetting its
> master to none and beginning to dropping control messages from the
> master, agent should not respond pings from master.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 3839a120446339fea8aa857f431a2dba28ed4002 
> 
> 
> Diff: https://reviews.apache.org/r/71742/diff/2/
> 
> 
> Testing
> -------
> 
> ==========] 2322 tests from 222 test cases ran. (1038166 ms total)
> [  PASSED  ] 2321 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] SlaveRecoveryTest/0.PingTimeoutDuringRecovery, where TypeParam = 
> mesos::internal::slave::MesosContainerizer
> 
> This failed test verifies that the agent responds to pings from the master 
> while the agent is performing recovery, this PR will break this scenario.
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>

Reply via email to