> On Aug. 28, 2018, 1:16 a.m., Alexander Rukletsov wrote:
> > src/checks/checker_process.cpp
> > Lines 878-889 (original), 890-901 (patched)
> > <https://reviews.apache.org/r/68495/diff/1/?file=2077041#file2077041line890>
> >
> >     It looks like we should always call `waitNestedContainer()` after we 
> > said `previousCheckContainerId = checkContainerId;`. For example here.
> >     
> >     Maybe it makes sense to call `waitNestedContainer()` right in the 
> > beginning? We can end up calling it twice, but I think it's fine?
> 
> Qian Zhang wrote:
>     > Maybe it makes sense to call waitNestedContainer() right in the 
> beginning? We can end up calling it twice, but I think it's fine?
>     
>     That means we will call agent API `WAIT_NESTED_CONTAINER` twice for each 
> successful launch of check container, I think that might be a burden for 
> agent in a large scale env. So I'd still prefer to call it only in the places 
> where we have to do it.
> 
> Qian Zhang wrote:
>     And for the case (L879:L888, i.e., the connection to agent failed) that 
> you pointed out above, I think when the connection to agent is back (e.g., 
> agent starts up again), the check container will be treated as orphan 
> container and destroyed by agent, and then we will remove it here: 
> https://github.com/apache/mesos/blob/1.6.1/src/checks/checker_process.cpp#L616:L638.
>  However I am going to post another patch to change these codes 
> (https://github.com/apache/mesos/blob/1.6.1/src/checks/checker_process.cpp#L660:L664)
>  to something like:
>     ```
>               promise->discard();
>             } else {
>               previousCheckContainerId = None();
>               _nestedCommandCheck(promise, cmd, nested);
>             }
>     ```
>     In this way, if we fail to remove the check container (e.g., due to agent 
> has not finished recovery, or the check container is still in `DESTROYING` 
> state), we will try to remove it again.
> 
> Qian Zhang wrote:
>     I posted another patch https://reviews.apache.org/r/68555/ as I mentioned 
> above.
> 
> Alexander Rukletsov wrote:
>     So the assumption you're making is that if `Connection::send()` fails, 
> the container will not start and hence we don't need to call `wait()` before 
> we destroy? It is not obvious to a reader and requires some non-local 
> knowledge about how the UCR works. Can you please leave a comment documenting 
> this assumption so that when it changes and someone will be trying to figure 
> out why checks sometimes started fail, they have a hint : )

> So the assumption you're making is that if Connection::send() fails, the 
> container will not start and hence we don't need to call wait() before we 
> destroy?

Actually the check container could be started even if `Connection::send()` 
fails, e.g., agent is killed right after the check container is launched but 
before the agent sends the launch result to the default executor. My point is, 
in this case the checker container will eventually destroyed by the agent after 
agent is started up again and we will eventually remove the check container 
because of https://reviews.apache.org/r/68555/ (i.e., keep retrying the remove 
until the check container is removed).


- Qian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68495/#review207984
-----------------------------------------------------------


On Aug. 24, 2018, 5:54 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68495/
> -----------------------------------------------------------
> 
> (Updated Aug. 24, 2018, 5:54 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik, Alexander Rukletsov, Gastón Kleiman, 
> and Gilbert Song.
> 
> 
> Bugs: MESOS-8568
>     https://issues.apache.org/jira/browse/MESOS-8568
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Made command check always waits before removing the nested container.
> 
> 
> Diffs
> -----
> 
>   src/checks/checker_process.cpp 77a76f465fe57eab89f027b5acb74c2339551678 
> 
> 
> Diff: https://reviews.apache.org/r/68495/diff/1/
> 
> 
> Testing
> -------
> 
> sudo make check
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>

Reply via email to