[ https://issues.apache.org/jira/browse/MESOS-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658795#comment-16658795 ]
Qian Zhang edited comment on MESOS-9334 at 10/22/18 9:55 AM: ------------------------------------------------------------- I added some logs into `libevent_poll.cpp` (see the diff below for details) and reproduced this issue a couple of times. {code:java} --- a/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp +++ b/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp @@ -32,11 +34,17 @@ struct Poll }; -void pollCallback(evutil_socket_t, short what, void* arg) +void pollCallback(evutil_socket_t fd, short what, void* arg) { Poll* poll = reinterpret_cast<Poll*>(arg); + LOG(INFO) << "==========pollCallback starts with fd " << fd + << " and with poll " << poll << "=========="; + if (poll->promise.future().hasDiscard()) { + LOG(INFO) << "==========pollCallback discards with fd " + << fd << "=========="; + poll->promise.discard(); } else { // Convert libevent specific EV_READ / EV_WRITE to io::* specific @@ -44,17 +52,24 @@ void pollCallback(evutil_socket_t, short what, void* arg) short events = ((what & EV_READ) ? io::READ : 0) | ((what & EV_WRITE) ? io::WRITE : 0); + LOG(INFO) << "==========pollCallback sets promise with fd " << fd + << " and with events " << events << "=========="; + poll->promise.set(events); } // Deleting the `poll` also destructs `ev` and hence triggers `event_free`, // which makes the event non-pending. delete poll; + + LOG(INFO) << "==========pollCallback ends with fd " << fd << "=========="; } void pollDiscard(const std::weak_ptr<event>& ev, short events) { + LOG(INFO) << "==========pollDiscard is called=========="; + // Discarding inside the event loop prevents `pollCallback()` from being // called twice if the future is discarded. run_in_event_loop([=]() { @@ -78,6 +93,9 @@ Future<short> poll(int_fd fd, short events) internal::Poll* poll = new internal::Poll(); + LOG(INFO) << "==========libevent starts polling with fd " << fd + << " and with poll " << poll << "=========="; + Future<short> future = poll->promise.future(); // Convert io::READ / io::WRITE to libevent specific values of these {code} Here is what I found in the agent log when this issue occurred (fd 48 is the stderr file descriptor of `NetworkCniIsolatorSetup`): {code:java} I1021 15:57:45.000000 2116 libevent_poll.cpp:96] ==========libevent starts polling with fd 48 and with poll 0x7f60df029eb0========== I1021 15:57:45.000000 2117 libevent_poll.cpp:41] ==========pollCallback starts with fd 48 and with poll 0x7f60e6e56c70========== I1021 15:57:45.000000 2117 libevent_poll.cpp:45] ==========pollCallback discards with fd 48========== I1021 15:57:45.000000 2117 libevent_poll.cpp:65] ==========pollCallback ends with fd 48========== {code} We can see libevent started to poll fd 48 with the poll object whose address is 0x7f60df029eb0, but when `pollCallback` was called for fd 48, the address of poll object is different (0x7f60e6e56c70) which has been discarded (see the third log line) !!! And when I searched 0x7f60e6e56c70 in the agent log, I found: {code:java} I1021 15:57:22.000000 2115 memory.cpp:478] Started listening for OOM events for container 4753a5ef-eccd-4373-b8f0-a4b40abf0fb5 I1021 15:57:22.000000 2115 libevent_poll.cpp:96] ==========libevent starts polling with fd 48 and with poll 0x7f60e6e56c70========== {code} So the poll object 0x7f60e6e56c70 was created 23 seconds ago with the same file descriptor (fd 48) which is used to listen OOM events for another container (4753a5ef-eccd-4373-b8f0-a4b40abf0fb5), and that container was destroyed at 15:57:44 right before agent started to wait the stderr of `NetworkCniIsolatorSetup` (15:57:45). {code:java} I1021 15:57:44.000000 2114 containerizer.cpp:2459] Destroying container 4753a5ef-eccd-4373-b8f0-a4b40abf0fb5 in RUNNING state {code} I reproduced this issue a couple of times, the observations I had from agent log are same as the above. was (Author: qianzhang): I added some logs into `libevent_poll.cpp` (see the diff below for details) and reproduced this issue a couple of times. {code:java} --- a/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp +++ b/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp @@ -32,11 +34,17 @@ struct Poll }; -void pollCallback(evutil_socket_t, short what, void* arg) +void pollCallback(evutil_socket_t fd, short what, void* arg) { Poll* poll = reinterpret_cast<Poll*>(arg); + LOG(INFO) << "==========pollCallback starts with fd " << fd + << " and with poll " << poll << "=========="; + if (poll->promise.future().hasDiscard()) { + LOG(INFO) << "==========pollCallback discards with fd " + << fd << "=========="; + poll->promise.discard(); } else { // Convert libevent specific EV_READ / EV_WRITE to io::* specific @@ -44,17 +52,24 @@ void pollCallback(evutil_socket_t, short what, void* arg) short events = ((what & EV_READ) ? io::READ : 0) | ((what & EV_WRITE) ? io::WRITE : 0); + LOG(INFO) << "==========pollCallback sets promise with fd " << fd + << " and with events " << events << "=========="; + poll->promise.set(events); } // Deleting the `poll` also destructs `ev` and hence triggers `event_free`, // which makes the event non-pending. delete poll; + + LOG(INFO) << "==========pollCallback ends with fd " << fd << "=========="; } void pollDiscard(const std::weak_ptr<event>& ev, short events) { + LOG(INFO) << "==========pollDiscard is called=========="; + // Discarding inside the event loop prevents `pollCallback()` from being // called twice if the future is discarded. run_in_event_loop([=]() { @@ -78,6 +93,9 @@ Future<short> poll(int_fd fd, short events) internal::Poll* poll = new internal::Poll(); + LOG(INFO) << "==========libevent starts polling with fd " << fd + << " and with poll " << poll << "=========="; + Future<short> future = poll->promise.future(); // Convert io::READ / io::WRITE to libevent specific values of these {code} Here is what I found in the agent log when this issue occurred (fd 48 is the stderr file descriptor of `NetworkCniIsolatorSetup`): {code:java} I1021 15:57:45.000000 2116 libevent_poll.cpp:96] ==========libevent starts polling with fd 48 and with poll 0x7f60df029eb0========== I1021 15:57:45.000000 2117 libevent_poll.cpp:41] ==========pollCallback starts with fd 48 and with poll 0x7f60e6e56c70========== I1021 15:57:45.000000 2117 libevent_poll.cpp:45] ==========pollCallback discards with fd 48========== I1021 15:57:45.000000 2117 libevent_poll.cpp:65] ==========pollCallback ends with fd 48========== {code} We can see libevent started to poll fd 48 with the poll object whose address is 0x7f60df029eb0, but when `pollCallback` was called for fd 48, the address of poll object is different (0x7f60e6e56c70) which has been discarded (see the third log line) !!! And when I searched 0x7f60e6e56c70 in the agent log, I found: {code:java} I1021 15:57:22.000000 2115 memory.cpp:478] Started listening for OOM events for container 4753a5ef-eccd-4373-b8f0-a4b40abf0fb5 I1021 15:57:22.000000 2115 libevent_poll.cpp:96] ==========libevent starts polling with fd 48 and with poll 0x7f60e6e56c70========== {code} So the poll object 0x7f60e6e56c70 was created 23 seconds ago with the same file descriptor (fd 48) which is used to listen OOM events for another container (4753a5ef-eccd-4373-b8f0-a4b40abf0fb5), and that container was destroyed at 15:57:44 right before agent started to wait the stderr of `NetworkCniIsolatorSetup` (15:57:45). {code:java} I1021 15:57:44.000000 2114 containerizer.cpp:2459] Destroying container 4753a5ef-eccd-4373-b8f0-a4b40abf0fb5 in RUNNING state {code} I reproduced this issue a couple of times, the observations I had from agent log are same as the above. > Container stuck at ISOLATING state due to libevent poll never returns > --------------------------------------------------------------------- > > Key: MESOS-9334 > URL: https://issues.apache.org/jira/browse/MESOS-9334 > Project: Mesos > Issue Type: Bug > Components: containerization > Reporter: Qian Zhang > Priority: Critical > > We found UCR container may be stuck at `ISOLATING` state: > {code:java} > 2018-10-03 09:13:23: I1003 09:13:23.274561 2355 containerizer.cpp:3122] > Transitioning the state of container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 > from PREPARING to ISOLATING > 2018-10-03 09:13:23: I1003 09:13:23.279223 2354 cni.cpp:962] Bind mounted > '/proc/5244/ns/net' to > '/run/mesos/isolators/network/cni/1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54/ns' > for container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 > 2018-10-03 09:23:22: I1003 09:23:22.879868 2354 containerizer.cpp:2459] > Destroying container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 in ISOLATING state > {code} > In the above logs, the state of container > `1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54` was transitioned to `ISOLATING` at > 09:13:23, but did not transitioned to any other states until it was destroyed > due to the executor registration timeout (10 mins). And the destroy can never > complete since it needs to wait for the container to finish isolating. -- This message was sent by Atlassian JIRA (v7.6.3#76005)