[
https://issues.apache.org/jira/browse/MESOS-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658795#comment-16658795
]
Qian Zhang commented on MESOS-9334:
-----------------------------------
I added some logs into `libevent_poll.cpp` (see the diff below for details) and
reproduced this issue a couple of times.
{code:java}
--- a/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp
+++ b/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp
@@ -32,11 +34,17 @@ struct Poll
};
-void pollCallback(evutil_socket_t, short what, void* arg)
+void pollCallback(evutil_socket_t fd, short what, void* arg)
{
Poll* poll = reinterpret_cast<Poll*>(arg);
+ LOG(INFO) << "==========pollCallback starts with fd " << fd
+ << " and with poll " << poll << "==========";
+
if (poll->promise.future().hasDiscard()) {
+ LOG(INFO) << "==========pollCallback discards with fd "
+ << fd << "==========";
+
poll->promise.discard();
} else {
// Convert libevent specific EV_READ / EV_WRITE to io::* specific
@@ -44,17 +52,24 @@ void pollCallback(evutil_socket_t, short what, void* arg)
short events =
((what & EV_READ) ? io::READ : 0) | ((what & EV_WRITE) ? io::WRITE : 0);
+ LOG(INFO) << "==========pollCallback sets promise with fd " << fd
+ << " and with events " << events << "==========";
+
poll->promise.set(events);
}
// Deleting the `poll` also destructs `ev` and hence triggers `event_free`,
// which makes the event non-pending.
delete poll;
+
+ LOG(INFO) << "==========pollCallback ends with fd " << fd << "==========";
}
void pollDiscard(const std::weak_ptr<event>& ev, short events)
{
+ LOG(INFO) << "==========pollDiscard is called==========";
+
// Discarding inside the event loop prevents `pollCallback()` from being
// called twice if the future is discarded.
run_in_event_loop([=]() {
@@ -78,6 +93,9 @@ Future<short> poll(int_fd fd, short events)
internal::Poll* poll = new internal::Poll();
+ LOG(INFO) << "==========libevent starts polling with fd " << fd
+ << " and with poll " << poll << "==========";
+
Future<short> future = poll->promise.future();
// Convert io::READ / io::WRITE to libevent specific values of these
{code}
Here is what I found in the agent log when this issue occurred (fd 48 is the
stderr file descriptor of `NetworkCniIsolatorSetup`):
{code:java}
I1021 15:57:45.000000 2116 libevent_poll.cpp:96] ==========libevent starts
polling with fd 48 and with poll 0x7f60df029eb0==========
I1021 15:57:45.000000 2117 libevent_poll.cpp:41] ==========pollCallback starts
with fd 48 and with poll 0x7f60e6e56c70==========
I1021 15:57:45.000000 2117 libevent_poll.cpp:45] ==========pollCallback
discards with fd 48==========
I1021 15:57:45.000000 2117 libevent_poll.cpp:65] ==========pollCallback ends
with fd 48==========
{code}
We can see libevent started to poll fd 48 with the poll object whose address is
0x7f60df029eb0, but when `pollCallback` was called for fd 48, the address of
poll object is different (0x7f60e6e56c70) which has been discarded (see the
third log line) !!!
And when I searched 0x7f60e6e56c70 in the agent log, I found:
{code:java}
I1021 15:57:22.000000 2115 memory.cpp:478] Started listening for OOM events for
container 4753a5ef-eccd-4373-b8f0-a4b40abf0fb5
I1021 15:57:22.000000 2115 libevent_poll.cpp:96] ==========libevent starts
polling with fd 48 and with poll 0x7f60e6e56c70==========
{code}
So the poll object 0x7f60e6e56c70 was created 23 seconds ago with the same file
descriptor (fd 48) which is used to listen OOM events for another container
(4753a5ef-eccd-4373-b8f0-a4b40abf0fb5), and that container was destroyed at
15:57:44 right before agent started to wait the stderr of
`NetworkCniIsolatorSetup` (15:57:45).
{code:java}
I1021 15:57:44.000000 2114 containerizer.cpp:2459] Destroying container
4753a5ef-eccd-4373-b8f0-a4b40abf0fb5 in RUNNING state
{code}
I reproduced this issue a couple of times, the observations I had from agent
log are same as the above.
> Container stuck at ISOLATING state due to libevent poll never returns
> ---------------------------------------------------------------------
>
> Key: MESOS-9334
> URL: https://issues.apache.org/jira/browse/MESOS-9334
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Reporter: Qian Zhang
> Priority: Critical
>
> We found UCR container may be stuck at `ISOLATING` state:
> {code:java}
> 2018-10-03 09:13:23: I1003 09:13:23.274561 2355 containerizer.cpp:3122]
> Transitioning the state of container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54
> from PREPARING to ISOLATING
> 2018-10-03 09:13:23: I1003 09:13:23.279223 2354 cni.cpp:962] Bind mounted
> '/proc/5244/ns/net' to
> '/run/mesos/isolators/network/cni/1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54/ns'
> for container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54
> 2018-10-03 09:23:22: I1003 09:23:22.879868 2354 containerizer.cpp:2459]
> Destroying container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 in ISOLATING state
> {code}
> In the above logs, the state of container
> `1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54` was transitioned to `ISOLATING` at
> 09:13:23, but did not transitioned to any other states until it was destroyed
> due to the executor registration timeout (10 mins). And the destroy can never
> complete since it needs to wait for the container to finish isolating.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)