[
https://issues.apache.org/jira/browse/MESOS-9174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16588881#comment-16588881
]
Stephan Erb commented on MESOS-9174:
------------------------------------
[~jieyu], we have found something interesting related to
[https://reviews.apache.org/r/62800]
I have not checked the entire cluster, but on a first sight it seems as if
there are problems related to systemd.
*Nodes without recovery issues*
Those are running Mesos 1.6.1 and systemd 232-25+deb9u4 (Debian Stretch).
{code:java}
$ systemd-cgls
Control group /:
-.slice
├─mesos
│ ├─9b70ff19-238c-4520-978c-688b83e705ce
│ │ ├─ 5129 /usr/lib/x86_64-linux-gnu/mesos/mesos-containerizer launch
...
├─init.scope
│ └─1 /sbin/init
└─system.slice
├─mesos-agent.service
│ └─1472 /usr/sbin/mesos-agent --master=file:///etc/mesos-agent/zk ...
{code}
*Nodes with recovery issues*
Those are running Mesos 1.6.1 and systemd 237-3~bpo9+1 (Debian Stretch
backports).
{code:java}
$ systemd-cgls
Control group /:
-.slice
├─init.scope
│ └─1 /sbin/init
└─system.slice
├─mesos-agent.service
│ ├─ 19151 haproxy -f haproxy.cfg -p haproxy.pid -sf 149
│ ├─ 39633 /usr/lib/x86_64-linux-gnu/mesos/mesos-containerizer launch
│ ├─ 39638 sh -c ${MESOS_SANDBOX=.}/thermos_executor_wrapper.sh ....
│ ├─ 39639 python2.7 ...
│ ├─ 39684 /usr/bin/python2.7...
│ ├─ 39710 /usr/bin/python2.7 ...
│ ├─ 39714 /usr/bin/ruby /usr/bin/synapse -c synapse.conf
│ ├─ 39775 haproxy -f haproxy.cfg -p haproxy.pid -sf
│ ├─ 39837 /usr/bin/python2.7 ...
{code}
In particular, there is no {{mesos}} group/section even though this is
perfectly show using systemd-cgtop:
{code:java}
/ 1700
- 45.2G - -
/mesos -
- 38.4G - -
/mesos/144dc11d-dbd0-42e8-89e4-a72384e777df -
- 1.6G - -
/mesos/15a8e488-495c-4db8-a11b-7e8277ec4c93 -
- 3.1G - -
/mesos/2a2b5913-2445-4111-9d18-71abc9f1f8cd -
- 1021.2M - -
/mesos/2e1c5c91-6a80-4242-b105-023c1eb2c89d -
- 2.6G - -
/mesos/356c5c0f-2ae0-4dfc-9415-d1dbeb172542 -
- 898.4M - -
/mesos/3baf4930-4332-4206-91d5-d39ea6bb3389 -
- 3.1G - -
/mesos/3d1b9554-911d-44ee-b204-fe622f02ef7a -
- 845.0M - -
/mesos/431aa2a0-11e4-4bf3-b888-ee10cf689326 -
- 1.3G - -
/mesos/94f8e3bb-360a-4694-9359-4da10cb4e5df -
- 1.2G - -
/mesos/9d1b3251-6c61-404e-88d0-03319d1a508c -
- 3.2G - -
/mesos/b5bb9133-4093-4bc6-90c1-3656b20559bf -
- 417.6M - -
/mesos/b89095dd-21bc-4255-86c8-14bd7cd0ac2a -
- 1.5G - -
/system.slice 1137
- 8.7G - -
{code}
The output of the faulty node above is from the same node that I have used to
pull the mesos-agent.log.
I will try to reproduce the issue by upgrading systemd in another test
environment and then report back. Newer systemd version have a changed
behaviour of {{Delegate=}} which could indeed be related to the observed issue.
> Unexpected containers transition from RUNNING to DESTROYING during recovery
> ---------------------------------------------------------------------------
>
> Key: MESOS-9174
> URL: https://issues.apache.org/jira/browse/MESOS-9174
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Affects Versions: 1.5.0, 1.6.1
> Reporter: Stephan Erb
> Priority: Major
> Attachments: mesos-agent.log, mesos-executor-stderr.log
>
>
> I am trying to hunt down a weird issue where sometimes restarting a Mesos
> agent takes down all Mesos containers. The containers die without an apparent
> cause:
> {code}
> I0821 13:35:01.486346 61392 linux_launcher.cpp:360] Recovered container
> 02da7be0-271e-449f-9554-dc776adb29a9
> I0821 13:35:03.627367 61362 provisioner.cpp:451] Recovered container
> 02da7be0-271e-449f-9554-dc776adb29a9
> I0821 13:35:03.701448 61375 containerizer.cpp:2835] Container
> 02da7be0-271e-449f-9554-dc776adb29a9 has exited
> I0821 13:35:03.701453 61375 containerizer.cpp:2382] Destroying container
> 02da7be0-271e-449f-9554-dc776adb29a9 in RUNNING state
> I0821 13:35:03.701457 61375 containerizer.cpp:2996] Transitioning the state
> of container 02da7be0-271e-449f-9554-dc776adb29a9 from RUNNING to DESTROYING
> {code}
> From the perspective of the executor, there is nothing relevant in the logs.
> Everything just stops directly as if the container gets terminated externally
> without notifying the executor first. For further details, please see the
> attached agent log and one (example) executor log file.
> I am aware that this is a long shot, but anyone an idea what I should be
> looking at to narrow down the issue?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)