[
https://issues.apache.org/jira/browse/MESOS-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669327#comment-16669327
]
Qian Zhang commented on MESOS-9334:
-----------------------------------
commit 610064942d4a75f16f045480ca9e3414d37f1ecc
Author: Benjamin Mahler
Date: Fri Oct 26 10:50:54 2018 +0800
Fixed an early fd close in the cgroups event notifier.
The cgroups event notifier was closing the eventfd while an
`io::read()` operation may be in progress. This can lead to
bugs where the fd gets re-used and read from a stale io::read.
Review: https://reviews.apache.org/r/69123/
commit 9bd3a32b68b165d9a1a45548cebe3b22069cecc0
Author: Qian Zhang
Date: Fri Oct 26 10:57:20 2018 +0800
Ensured failed / discarded cgroups OOM notification is logged.
Failed or discarded OOM notificaitions in the cgroups memory
subsystem were not being logged, due to the continuation being
accidentally set up using `onReady` rather than `onAny`.
Review: https://reviews.apache.org/r/69188
> Container stuck at ISOLATING state due to libevent poll never returns
> ---------------------------------------------------------------------
>
> Key: MESOS-9334
> URL: https://issues.apache.org/jira/browse/MESOS-9334
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Reporter: Qian Zhang
> Assignee: Qian Zhang
> Priority: Critical
>
> We found UCR container may be stuck at `ISOLATING` state:
> {code:java}
> 2018-10-03 09:13:23: I1003 09:13:23.274561 2355 containerizer.cpp:3122]
> Transitioning the state of container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54
> from PREPARING to ISOLATING
> 2018-10-03 09:13:23: I1003 09:13:23.279223 2354 cni.cpp:962] Bind mounted
> '/proc/5244/ns/net' to
> '/run/mesos/isolators/network/cni/1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54/ns'
> for container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54
> 2018-10-03 09:23:22: I1003 09:23:22.879868 2354 containerizer.cpp:2459]
> Destroying container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 in ISOLATING state
> {code}
> In the above logs, the state of container
> `1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54` was transitioned to `ISOLATING` at
> 09:13:23, but did not transitioned to any other states until it was destroyed
> due to the executor registration timeout (10 mins). And the destroy can never
> complete since it needs to wait for the container to finish isolating.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)