[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591103#comment-16591103
 ] 

Qian Zhang edited comment on MESOS-8568 at 8/24/18 3:20 AM:
------------------------------------------------------------

[~vinodkone] Yeah, I noticed that as well. When the I/O switchboard server 
process is launched, it just [waits on a 
promise|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L1181],
 and that promise will only be set when a `ATTACH_CONTAINER_OUTPUT` call is 
made or a `SIGTERM` is sent. In this case, `ATTACH_CONTAINER_OUTPUT` will never 
be made since the check container was failed to launch, so we have to wait 5s 
for the `SIGTERM`.

I am not quite sure which cases that [SIGTERM & 5s 
timeout|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L810:L818]
 is for, maybe [~jieyu] and [~klueska] have more info?


was (Author: qianzhang):
[~vinodkone] Yeah, I noticed that as well. When the I/O switchboard server 
process is launched, it just [waits on a 
promise|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L1181],
 and that promise will only be set when a `ATTACH_CONTAINER_OUTPUT` call is 
made or a `SIGTERM` is sent. In this case, `ATTACH_CONTAINER_OUTPUT` will never 
be made since the check container was failed to launch, so we have to wait 5s 
for the `SIGTERM`.

I am not quite sure which cases that [5s 
timeout|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L810:L818]
 is for, maybe [~jieyu] and [~klueska] have more info?

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> ------------------------------------------------------------------------------------------
>
>                 Key: MESOS-8568
>                 URL: https://issues.apache.org/jira/browse/MESOS-8568
>             Project: Mesos
>          Issue Type: Improvement
>            Reporter: Andrei Budnik
>            Assignee: Qian Zhang
>            Priority: Blocker
>              Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to