> On Aug. 27, 2018, 5:16 p.m., Alexander Rukletsov wrote:
> > src/checks/checker_process.cpp
> > Lines 878-889 (original), 890-901 (patched)
> > <https://reviews.apache.org/r/68495/diff/1/?file=2077041#file2077041line890>
> >
> >     It looks like we should always call `waitNestedContainer()` after we 
> > said `previousCheckContainerId = checkContainerId;`. For example here.
> >     
> >     Maybe it makes sense to call `waitNestedContainer()` right in the 
> > beginning? We can end up calling it twice, but I think it's fine?
> 
> Qian Zhang wrote:
>     > Maybe it makes sense to call waitNestedContainer() right in the 
> beginning? We can end up calling it twice, but I think it's fine?
>     
>     That means we will call agent API `WAIT_NESTED_CONTAINER` twice for each 
> successful launch of check container, I think that might be a burden for 
> agent in a large scale env. So I'd still prefer to call it only in the places 
> where we have to do it.
> 
> Qian Zhang wrote:
>     And for the case (L879:L888, i.e., the connection to agent failed) that 
> you pointed out above, I think when the connection to agent is back (e.g., 
> agent starts up again), the check container will be treated as orphan 
> container and destroyed by agent, and then we will remove it here: 
> https://github.com/apache/mesos/blob/1.6.1/src/checks/checker_process.cpp#L616:L638.
>  However I am going to post another patch to change these codes 
> (https://github.com/apache/mesos/blob/1.6.1/src/checks/checker_process.cpp#L660:L664)
>  to something like:
>     ```
>               promise->discard();
>             } else {
>               previousCheckContainerId = None();
>               _nestedCommandCheck(promise, cmd, nested);
>             }
>     ```
>     In this way, if we fail to remove the check container (e.g., due to agent 
> has not finished recovery, or the check container is still in `DESTROYING` 
> state), we will try to remove it again.
> 
> Qian Zhang wrote:
>     I posted another patch https://reviews.apache.org/r/68555/ as I mentioned 
> above.

So the assumption you're making is that if `Connection::send()` fails, the 
container will not start and hence we don't need to call `wait()` before we 
destroy? It is not obvious to a reader and requires some non-local knowledge 
about how the UCR works. Can you please leave a comment documenting this 
assumption so that when it changes and someone will be trying to figure out why 
checks sometimes started fail, they have a hint : )


- Alexander


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68495/#review207984
-----------------------------------------------------------


On Aug. 24, 2018, 9:54 a.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68495/
> -----------------------------------------------------------
> 
> (Updated Aug. 24, 2018, 9:54 a.m.)
> 
> 
> Review request for mesos, Andrei Budnik, Alexander Rukletsov, Gastón Kleiman, 
> and Gilbert Song.
> 
> 
> Bugs: MESOS-8568
>     https://issues.apache.org/jira/browse/MESOS-8568
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Made command check always waits before removing the nested container.
> 
> 
> Diffs
> -----
> 
>   src/checks/checker_process.cpp 77a76f465fe57eab89f027b5acb74c2339551678 
> 
> 
> Diff: https://reviews.apache.org/r/68495/diff/1/
> 
> 
> Testing
> -------
> 
> sudo make check
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>

Reply via email to