[ 
https://issues.apache.org/jira/browse/MESOS-9174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589939#comment-16589939
 ] 

Stephan Erb commented on MESOS-9174:
------------------------------------

I have run a few more experiments:

*Broken setup*: Containers get terminated on agent restarts

* Default setup with new systemd: 
** systemd 237-3~bpo9+1 with options {{Delegate=true}} and 
{{KillMode=control-group}}
** Mesos 1.6.1 with option {{--systemd_enable_support}} 

*Working setups*: Containers survive agent restarts

* Default setup with old systemd:
** systemd 232-25+deb9u4 with options {{Delegate=true}} and 
{{KillMode=control-group}}
** Mesos 1.6.1 with option {{--systemd_enable_support}} 

* New systemd with disabled cgroup interference
** systemd 237-3~bpo9+1 with options {{Delegate=true}} and {{KillMode=process}}
** Mesos 1.6.1 with option {{--no-systemd_enable_support}}

For now, we will ensure that we just run older systemd version across our 
fleets as a workaround.

> Unexpected containers transition from RUNNING to DESTROYING during recovery
> ---------------------------------------------------------------------------
>
>                 Key: MESOS-9174
>                 URL: https://issues.apache.org/jira/browse/MESOS-9174
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>    Affects Versions: 1.5.0, 1.6.1
>            Reporter: Stephan Erb
>            Priority: Major
>         Attachments: mesos-agent.log, mesos-executor-stderr.log
>
>
> I am trying to hunt down a weird issue where sometimes restarting a Mesos 
> agent takes down all Mesos containers. The containers die without an apparent 
> cause:
> {code}
> I0821 13:35:01.486346 61392 linux_launcher.cpp:360] Recovered container 
> 02da7be0-271e-449f-9554-dc776adb29a9
> I0821 13:35:03.627367 61362 provisioner.cpp:451] Recovered container 
> 02da7be0-271e-449f-9554-dc776adb29a9
> I0821 13:35:03.701448 61375 containerizer.cpp:2835] Container 
> 02da7be0-271e-449f-9554-dc776adb29a9 has exited
> I0821 13:35:03.701453 61375 containerizer.cpp:2382] Destroying container 
> 02da7be0-271e-449f-9554-dc776adb29a9 in RUNNING state
> I0821 13:35:03.701457 61375 containerizer.cpp:2996] Transitioning the state 
> of container 02da7be0-271e-449f-9554-dc776adb29a9 from RUNNING to DESTROYING
> {code}
> From the perspective of the executor, there is nothing relevant in the logs. 
> Everything just stops directly as if the container gets terminated externally 
> without notifying the executor first. For further details, please see the 
> attached agent log and one (example) executor log file.
> I am aware that this is a long shot, but anyone an idea what I should be 
> looking at to narrow down the issue?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to