[ 
https://issues.apache.org/jira/browse/MESOS-9174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16588881#comment-16588881
 ] 

Stephan Erb commented on MESOS-9174:
------------------------------------

[~jieyu], we have found something interesting related to 
[https://reviews.apache.org/r/62800]

I have not checked the entire cluster, but on a first sight it seems as if 
there are problems related to systemd.

*Nodes without recovery issues*
 Those are running Mesos 1.6.1 and systemd 232-25+deb9u4 (Debian Stretch).
{code:java}
$ systemd-cgls
Control group /:
-.slice
├─mesos
│ ├─9b70ff19-238c-4520-978c-688b83e705ce
│ │ ├─ 5129 /usr/lib/x86_64-linux-gnu/mesos/mesos-containerizer launch
...
├─init.scope
│ └─1 /sbin/init
└─system.slice
  ├─mesos-agent.service
  │ └─1472 /usr/sbin/mesos-agent --master=file:///etc/mesos-agent/zk ...
{code}
*Nodes with recovery issues*
 Those are running Mesos 1.6.1 and systemd 237-3~bpo9+1 (Debian Stretch 
backports).
{code:java}
$ systemd-cgls
Control group /:
-.slice
├─init.scope
│ └─1 /sbin/init
└─system.slice
  ├─mesos-agent.service
  │ ├─ 19151 haproxy -f haproxy.cfg -p haproxy.pid -sf 149
  │ ├─ 39633 /usr/lib/x86_64-linux-gnu/mesos/mesos-containerizer launch
  │ ├─ 39638 sh -c ${MESOS_SANDBOX=.}/thermos_executor_wrapper.sh ....
  │ ├─ 39639 python2.7 ...
  │ ├─ 39684 /usr/bin/python2.7...
  │ ├─ 39710 /usr/bin/python2.7 ...
  │ ├─ 39714 /usr/bin/ruby /usr/bin/synapse -c synapse.conf
  │ ├─ 39775 haproxy -f haproxy.cfg -p haproxy.pid -sf
  │ ├─ 39837 /usr/bin/python2.7 ...
{code}
In particular, there is no {{mesos}} group/section even though this is 
perfectly show using systemd-cgtop:
{code:java}
/                                                                        1700   
   -    45.2G        -        -
/mesos                                                                      -   
   -    38.4G        -        -
/mesos/144dc11d-dbd0-42e8-89e4-a72384e777df                                 -   
   -     1.6G        -        -
/mesos/15a8e488-495c-4db8-a11b-7e8277ec4c93                                 -   
   -     3.1G        -        -
/mesos/2a2b5913-2445-4111-9d18-71abc9f1f8cd                                 -   
   -  1021.2M        -        -
/mesos/2e1c5c91-6a80-4242-b105-023c1eb2c89d                                 -   
   -     2.6G        -        -
/mesos/356c5c0f-2ae0-4dfc-9415-d1dbeb172542                                 -   
   -   898.4M        -        -
/mesos/3baf4930-4332-4206-91d5-d39ea6bb3389                                 -   
   -     3.1G        -        -
/mesos/3d1b9554-911d-44ee-b204-fe622f02ef7a                                 -   
   -   845.0M        -        -
/mesos/431aa2a0-11e4-4bf3-b888-ee10cf689326                                 -   
   -     1.3G        -        -
/mesos/94f8e3bb-360a-4694-9359-4da10cb4e5df                                 -   
   -     1.2G        -        -
/mesos/9d1b3251-6c61-404e-88d0-03319d1a508c                                 -   
   -     3.2G        -        -
/mesos/b5bb9133-4093-4bc6-90c1-3656b20559bf                                 -   
   -   417.6M        -        -
/mesos/b89095dd-21bc-4255-86c8-14bd7cd0ac2a                                 -   
   -     1.5G        -        -
/system.slice                                                            1137   
   -     8.7G        -        -
{code}
The output of the faulty node above is from the same node that I have used to 
pull the mesos-agent.log.

I will try to reproduce the issue by upgrading systemd in another test 
environment and then report back. Newer systemd version have a changed 
behaviour of {{Delegate=}} which could indeed be related to the observed issue.

> Unexpected containers transition from RUNNING to DESTROYING during recovery
> ---------------------------------------------------------------------------
>
>                 Key: MESOS-9174
>                 URL: https://issues.apache.org/jira/browse/MESOS-9174
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>    Affects Versions: 1.5.0, 1.6.1
>            Reporter: Stephan Erb
>            Priority: Major
>         Attachments: mesos-agent.log, mesos-executor-stderr.log
>
>
> I am trying to hunt down a weird issue where sometimes restarting a Mesos 
> agent takes down all Mesos containers. The containers die without an apparent 
> cause:
> {code}
> I0821 13:35:01.486346 61392 linux_launcher.cpp:360] Recovered container 
> 02da7be0-271e-449f-9554-dc776adb29a9
> I0821 13:35:03.627367 61362 provisioner.cpp:451] Recovered container 
> 02da7be0-271e-449f-9554-dc776adb29a9
> I0821 13:35:03.701448 61375 containerizer.cpp:2835] Container 
> 02da7be0-271e-449f-9554-dc776adb29a9 has exited
> I0821 13:35:03.701453 61375 containerizer.cpp:2382] Destroying container 
> 02da7be0-271e-449f-9554-dc776adb29a9 in RUNNING state
> I0821 13:35:03.701457 61375 containerizer.cpp:2996] Transitioning the state 
> of container 02da7be0-271e-449f-9554-dc776adb29a9 from RUNNING to DESTROYING
> {code}
> From the perspective of the executor, there is nothing relevant in the logs. 
> Everything just stops directly as if the container gets terminated externally 
> without notifying the executor first. For further details, please see the 
> attached agent log and one (example) executor log file.
> I am aware that this is a long shot, but anyone an idea what I should be 
> looking at to narrow down the issue?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to