[
https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586013#comment-16586013
]
Qian Zhang commented on MESOS-9131:
-----------------------------------
The root cause of this issue is, the I/O switchboard server process never exits
which causes `IOSwitchboard::cleanup` never returns, as a result the nested
container launched by the `dcos` command will be stuck at `DESTROYING` state.
It seems this issue was introduced by [https://reviews.apache.org/r/65122], in
particular this code diff:
{code:java}
@@ -1217,3 +1219,8 @@
.then(defer(self(), [this]() {
- terminate(self(), false);
+ redirectFinished = true;
+ // If IO redirect is finished, we need to give a chance
+ // to send a http response for an input connection.
+ if (!inputConnected) {
+ terminate(self(), false);
+ }
return Nothing();{code}
After I simply reverted the above change, this issue was gone, i.e., the
containers can be destroyed successfully.
> Health checks launching nested containers while a container is being
> destroyed lead to unkillable tasks
> -------------------------------------------------------------------------------------------------------
>
> Key: MESOS-9131
> URL: https://issues.apache.org/jira/browse/MESOS-9131
> Project: Mesos
> Issue Type: Bug
> Components: agent, containerization
> Affects Versions: 1.5.1
> Reporter: Jan Schlicht
> Assignee: Qian Zhang
> Priority: Blocker
> Labels: container-stuck
>
> A container might get stuck in {{DESTROYING}} state if there's a command
> health check that starts new nested containers while its parent container is
> getting destroyed.
> Here are some logs which unrelated lines removed. The
> `REMOVE_NESTED_CONTAINER`/`LAUNCH_NESTED_CONTAINER_SESSION` keeps looping
> afterwards.
> {noformat}
> 2018-04-16 12:37:54: I0416 12:37:54.235877 3863 containerizer.cpp:2807]
> Container
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 has
> exited
> 2018-04-16 12:37:54: I0416 12:37:54.235914 3863 containerizer.cpp:2354]
> Destroying container
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 in
> RUNNING state
> 2018-04-16 12:37:54: I0416 12:37:54.235932 3863 containerizer.cpp:2968]
> Transitioning the state of container
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133
> from RUNNING to DESTROYING
> 2018-04-16 12:37:54: I0416 12:37:54.236100 3852 linux_launcher.cpp:514]
> Asked to destroy container
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.237671 3852 linux_launcher.cpp:560]
> Using freezer to destroy cgroup
> mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.240327 3852 cgroups.cpp:3060] Freezing
> cgroup
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.244179 3852 cgroups.cpp:1415]
> Successfully froze cgroup
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> after 3.814144ms
> 2018-04-16 12:37:54: I0416 12:37:54.250550 3853 cgroups.cpp:3078] Thawing
> cgroup
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.256599 3853 cgroups.cpp:1444]
> Successfully thawed cgroup
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> after 5.977856ms
> ...
> 2018-04-16 12:37:54: I0416 12:37:54.371117 3837 http.cpp:3502] Processing
> LAUNCH_NESTED_CONTAINER_SESSION call for container
> 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd'
> 2018-04-16 12:37:54: W0416 12:37:54.371692 3842 http.cpp:2758] Failed to
> launch container
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd:
> Parent container
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is
> in 'DESTROYING' state
> 2018-04-16 12:37:54: W0416 12:37:54.371826 3840 containerizer.cpp:2337]
> Attempted to destroy unknown container
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd
> ...
> 2018-04-16 12:37:55: I0416 12:37:55.504456 3856 http.cpp:3078] Processing
> REMOVE_NESTED_CONTAINER call for container
> 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-f3a1238c-7f0f-4db3-bda4-c0ea951d46b6'
> ...
> 2018-04-16 12:37:55: I0416 12:37:55.556367 3857 http.cpp:3502] Processing
> LAUNCH_NESTED_CONTAINER_SESSION call for container
> 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211'
> ...
> 2018-04-16 12:37:55: W0416 12:37:55.582137 3850 http.cpp:2758] Failed to
> launch container
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211:
> Parent container
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is
> in 'DESTROYING' state
> ...
> 2018-04-16 12:37:55: W0416 12:37:55.583330 3844 containerizer.cpp:2337]
> Attempted to destroy unknown container
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211
> ...
> {noformat}
> This stops when the framework reconciles and instructs Mesos to kill the
> task. Which also results in a
> {noformat}
> 2018-04-16 13:06:04: I0416 13:06:04.161623 3843 http.cpp:2966] Processing
> KILL_NESTED_CONTAINER call for container
> 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133'
> {noformat}
> Nothing else related to this container is logged following this line.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)