[ 
https://issues.apache.org/jira/browse/MESOS-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16656592#comment-16656592
 ] 

Qian Zhang commented on MESOS-9334:
-----------------------------------

I added some logs into Mesos agent and libprocess, and then found the CNI 
isolator may hang at two places when isolating a container:
 # Wait for a CNI plugin (see [this 
code|https://github.com/apache/mesos/blob/1.7.0/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L1327]
 for details). When this issue occurred, I found actually the CNI plugin has 
finished its job and exited, but the CNI isolator hung at reading the CNI 
plugin's stdout/stderr.
 # Wait for `NetworkCniIsolatorSetup` (see [this 
code|https://github.com/apache/mesos/blob/1.7.0/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L1154]
 for details). Similarly when this issue occurred, I found actually 
`NetworkCniIsolatorSetup` has finished its job and exited, but CNI isolator 
hung at reading its stderr.

For both of the above two cases, I found it was [libevent 
poll|https://github.com/apache/mesos/blob/1.7.0/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp#L75]
 never return for the fd to read stdout/stderr, and it seems there was no fd 
leak when this issue occurred.

> Container stuck at ISOLATING state due to libevent poll never returns
> ---------------------------------------------------------------------
>
>                 Key: MESOS-9334
>                 URL: https://issues.apache.org/jira/browse/MESOS-9334
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: Qian Zhang
>            Priority: Critical
>
> We found UCR container may be stuck at `ISOLATING` state:
>  
> {code:java}
> 2018-10-03 09:13:23: I1003 09:13:23.274561 2355 containerizer.cpp:3122] 
> Transitioning the state of container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 
> from PREPARING to ISOLATING
> 2018-10-03 09:13:23: I1003 09:13:23.279223 2354 cni.cpp:962] Bind mounted 
> '/proc/5244/ns/net' to 
> '/run/mesos/isolators/network/cni/1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54/ns' 
> for container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54
> 2018-10-03 09:23:22: I1003 09:23:22.879868 2354 containerizer.cpp:2459] 
> Destroying container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 in ISOLATING state
> {code}
>  
> In the above logs, the state of container 
> `1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54` was transitioned to `ISOLATING` at 
> 09:13:23, but did not transitioned to any other states until it was destroyed 
> due to the executor registration timeout (10 mins). And the destroy can never 
> complete since it needs to wait for the container to finish isolating.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to