[
https://issues.apache.org/jira/browse/MESOS-7777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chun-Hung Hsiao updated MESOS-7777:
-----------------------------------
Description:
Docker changed its default mount propagation to "shared" since 1.12 to enable
persistent volume plugins. However, Docker has a known issue
(https://github.com/moby/moby/issues/25718) that it sometimes leaks its mount
namespace to other processes, which could make Mesos agents fail to remove
Docker containers during recovery. The following shows the logs of such a
faliure:
{noformat}
I0615 09:39:11.083787 4573 docker.cpp:1002] Skipping recovery of executor
'kafka__7e49099d-7ab4-4435-a94a-1e849b8f2b70' of framework
44cbe3e9-984d-4073-b523-0023b427f54d-0011 because its executor is not marked as
docker and the docker container doesn't exist
Failed to perform recovery: Collect failed: Collect failed: Failed to run
'docker -H unix:///var/run/docker.sock rm -v
2de71c5383cb887f3ee49de5a517545b0522e1bbcb5df618c7ddb8583fd1d12d': exited with
status 1; stderr='Error response from daemon: Driver overlay failed to remove
root filesystem
2de71c5383cb887f3ee49de5a517545b0522e1bbcb5df618c7ddb8583fd1d12d: remove
/var/lib/docker/overlay/221725ec545d60492b5431bb49380d868f7a949aaa3acff49f7ffb5bddeb3385/merged:
device or resource busy
'
To remedy this do as follows:
Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest
This ensures agent doesn't recover old live executors.
Step 2: Restart the agent.
{noformat}
was:Docker changed its default mount propagation to "shared" since 1.12 to
enable persistent volume plugins. However, Docker has a known issue
(https://github.com/moby/moby/issues/25718) that it sometimes leaks its mount
namespace to other processes, which could make Mesos agents fail to remove
Docker containers during recovery.
> Agent failed to recover due to mount namespace leakage in Docker 1.12/1.13
> --------------------------------------------------------------------------
>
> Key: MESOS-7777
> URL: https://issues.apache.org/jira/browse/MESOS-7777
> Project: Mesos
> Issue Type: Bug
> Components: docker
> Reporter: Chun-Hung Hsiao
> Assignee: Chun-Hung Hsiao
> Fix For: 1.4.0
>
>
> Docker changed its default mount propagation to "shared" since 1.12 to enable
> persistent volume plugins. However, Docker has a known issue
> (https://github.com/moby/moby/issues/25718) that it sometimes leaks its mount
> namespace to other processes, which could make Mesos agents fail to remove
> Docker containers during recovery. The following shows the logs of such a
> faliure:
> {noformat}
> I0615 09:39:11.083787 4573 docker.cpp:1002] Skipping recovery of executor
> 'kafka__7e49099d-7ab4-4435-a94a-1e849b8f2b70' of framework
> 44cbe3e9-984d-4073-b523-0023b427f54d-0011 because its executor is not marked
> as docker and the docker container doesn't exist
> Failed to perform recovery: Collect failed: Collect failed: Failed to run
> 'docker -H unix:///var/run/docker.sock rm -v
> 2de71c5383cb887f3ee49de5a517545b0522e1bbcb5df618c7ddb8583fd1d12d': exited
> with status 1; stderr='Error response from daemon: Driver overlay failed to
> remove root filesystem
> 2de71c5383cb887f3ee49de5a517545b0522e1bbcb5df618c7ddb8583fd1d12d: remove
> /var/lib/docker/overlay/221725ec545d60492b5431bb49380d868f7a949aaa3acff49f7ffb5bddeb3385/merged:
> device or resource busy
> '
> To remedy this do as follows:
> Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest
> This ensures agent doesn't recover old live executors.
> Step 2: Restart the agent.
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)