Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

Jiang Yan Xu Fri, 05 Jan 2018 11:12:12 -0800


> On Jan. 4, 2018, 2:09 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 10061 (patched)
> > <https://reviews.apache.org/r/64940/diff/1/?file=1930130#file1930130line10062>
> >
> >     Perhaps move all the logic around determining the unreachable task down 
> > here.
> >     
> >     ```
> >     // We need to inform `removeTask` whether the task became unreachable 
> > due to the agent being removed. Note that terminal tasks are not considered 
> > unreachable. We can not rely on the task's state because it may have been 
> > set to `TASK_LOST` for backwards-compatibility.
> >     bool unreachable =
> >       unreachableTime.isSome() && !protobuf::isTerminalState(task->state());
> >     ```
> >     
> >     Inlining `unreachable` is probably fine too but separating it helps 
> > clarify it with comments?
> 
> James Peach wrote:
>     You have to sample the task state before the update because it might 
> transition to a terminal state at that time (and LOST is terminal).


Right, so put the line above `updateTask`? The main thing I am commenting on is 
that we can consolidate the logic without first defining `bool unreachable = 
true;` much far above?


> On Jan. 4, 2018, 2:09 p.m., Jiang Yan Xu wrote:
> > src/tests/master_tests.cpp
> > Lines 7637 (patched)
> > <https://reviews.apache.org/r/64940/diff/1/?file=1930131#file1930131line7637>
> >
> >     You could just wait for the the status update (being droppped) instead 
> > of capaturing `Slave::executorTerminated` and risk the race condition of 
> > the master not receiving the status update before the agent is deemed 
> > partitioned?
> 
> James Peach wrote:
>     Dealing with all the status updates was less reliable.

Do you see the race condition I was trying to point out?


> On Jan. 4, 2018, 2:09 p.m., Jiang Yan Xu wrote:
> > src/tests/master_tests.cpp
> > Lines 7686 (patched)
> > <https://reviews.apache.org/r/64940/diff/1/?file=1930131#file1930131line7686>
> >
> >     No need to settle?
> 
> James Peach wrote:
>     This is settling to ensure that events following from marking the agent 
> down are fully processed.

Oh ok that's right.


- Jiang Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194777
-----------------------------------------------------------


On Jan. 5, 2018, 11:06 a.m., James Peach wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> -----------------------------------------------------------
> 
> (Updated Jan. 5, 2018, 11:06 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod 
> Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
>     https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> If an agent is lost, we try to remove all the tasks that might
> have been lost. However, if a task is already terminal, it hasn't
> really been lost so we should not be tracking it in the framework's
> unreachable tasks list.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 
>   src/master/master.cpp 28d8be3a4769b418b61cff0b95845e4232135bc7 
>   src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/4/
> 
> 
> Testing
> -------
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

Reply via email to