[ https://issues.apache.org/jira/browse/MESOS-9174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589939#comment-16589939 ]
Stephan Erb commented on MESOS-9174: ------------------------------------ I have run a few more experiments: *Broken setup*: Containers get terminated on agent restarts * Default setup with new systemd: ** systemd 237-3~bpo9+1 with options {{Delegate=true}} and {{KillMode=control-group}} ** Mesos 1.6.1 with option {{--systemd_enable_support}} *Working setups*: Containers survive agent restarts * Default setup with old systemd: ** systemd 232-25+deb9u4 with options {{Delegate=true}} and {{KillMode=control-group}} ** Mesos 1.6.1 with option {{--systemd_enable_support}} * New systemd with disabled cgroup interference ** systemd 237-3~bpo9+1 with options {{Delegate=true}} and {{KillMode=process}} ** Mesos 1.6.1 with option {{--no-systemd_enable_support}} For now, we will ensure that we just run older systemd version across our fleets as a workaround. > Unexpected containers transition from RUNNING to DESTROYING during recovery > --------------------------------------------------------------------------- > > Key: MESOS-9174 > URL: https://issues.apache.org/jira/browse/MESOS-9174 > Project: Mesos > Issue Type: Bug > Components: containerization > Affects Versions: 1.5.0, 1.6.1 > Reporter: Stephan Erb > Priority: Major > Attachments: mesos-agent.log, mesos-executor-stderr.log > > > I am trying to hunt down a weird issue where sometimes restarting a Mesos > agent takes down all Mesos containers. The containers die without an apparent > cause: > {code} > I0821 13:35:01.486346 61392 linux_launcher.cpp:360] Recovered container > 02da7be0-271e-449f-9554-dc776adb29a9 > I0821 13:35:03.627367 61362 provisioner.cpp:451] Recovered container > 02da7be0-271e-449f-9554-dc776adb29a9 > I0821 13:35:03.701448 61375 containerizer.cpp:2835] Container > 02da7be0-271e-449f-9554-dc776adb29a9 has exited > I0821 13:35:03.701453 61375 containerizer.cpp:2382] Destroying container > 02da7be0-271e-449f-9554-dc776adb29a9 in RUNNING state > I0821 13:35:03.701457 61375 containerizer.cpp:2996] Transitioning the state > of container 02da7be0-271e-449f-9554-dc776adb29a9 from RUNNING to DESTROYING > {code} > From the perspective of the executor, there is nothing relevant in the logs. > Everything just stops directly as if the container gets terminated externally > without notifying the executor first. For further details, please see the > attached agent log and one (example) executor log file. > I am aware that this is a long shot, but anyone an idea what I should be > looking at to narrow down the issue? -- This message was sent by Atlassian JIRA (v7.6.3#76005)