[
https://issues.apache.org/jira/browse/MESOS-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661975#comment-16661975
]
Qian Zhang commented on MESOS-9334:
-----------------------------------
After reading some libevent code and our code to call libevent, I think the
root cause of this issue is, after we call libevent to poll an fd, that fd is
disabled inside libevent due to a race. Here is the flow:
# Container1 is launched and cgroups memory subsystem calls the function
`cgroups::memory::oom::listen()` to listen OOM event for this container, and
that function will internally open an fd, call libevent to poll it, and return
a future to cgroups memory subsystem.
# Container1 exits and when we destroy it, the cleanup method of cgroups
memory subsystem will discard the future got in #1. As the result,
`Listener::finalize()` will be called (see [this
code|https://github.com/apache/mesos/blob/1.7.0/src/linux/cgroups.cpp#L1069:L1087]
for details), and it will
** Discard the future returned by libevent poll which will cause
`pollDiscard()` called and that will trigger `pollCallback` to be executed
*asynchronously* (see [this
code|https://github.com/apache/mesos/blob/1.7.0/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp#L66:L70]
for details).
** Close the fd opened in #1 *immediately* which means the fd can be reused
now.
# Container2 is launched, and CNI isolator calls `io::read` to read the
stdout/stderr of CNI plugin for this container. Internally `io::read` *reuses*
the fd closed in #2 and call libevent to poll it.
# Now the function `pollCallback` for container1 is executed, and it will
delete the poll object which will trigger `event_free` to deallocate the event
for this container (see [this
code|https://github.com/apache/mesos/blob/1.7.0/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp#L50:L52]
for details). Internally `event_free` will call `event_del` ->
`event_del_internal` -> `evmap_io_del` -> `evsel->del` to *disable* the fd (see
[this
code|https://github.com/libevent/libevent/blob/release-2.0.22-stable/event-internal.h#L78:L79]
for details), but that fd is now used to read stdout/stderr for container2 in
#3. Since the fd is disabled inside libevent, the `io::read` we do in #3 will
never return so the container2 will be stuck at `ISOLATING` state.
> Container stuck at ISOLATING state due to libevent poll never returns
> ---------------------------------------------------------------------
>
> Key: MESOS-9334
> URL: https://issues.apache.org/jira/browse/MESOS-9334
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Reporter: Qian Zhang
> Assignee: Qian Zhang
> Priority: Critical
>
> We found UCR container may be stuck at `ISOLATING` state:
> {code:java}
> 2018-10-03 09:13:23: I1003 09:13:23.274561 2355 containerizer.cpp:3122]
> Transitioning the state of container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54
> from PREPARING to ISOLATING
> 2018-10-03 09:13:23: I1003 09:13:23.279223 2354 cni.cpp:962] Bind mounted
> '/proc/5244/ns/net' to
> '/run/mesos/isolators/network/cni/1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54/ns'
> for container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54
> 2018-10-03 09:23:22: I1003 09:23:22.879868 2354 containerizer.cpp:2459]
> Destroying container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 in ISOLATING state
> {code}
> In the above logs, the state of container
> `1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54` was transitioned to `ISOLATING` at
> 09:13:23, but did not transitioned to any other states until it was destroyed
> due to the executor registration timeout (10 mins). And the destroy can never
> complete since it needs to wait for the container to finish isolating.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)