[ 
https://issues.apache.org/jira/browse/MESOS-7777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-7777:
-----------------------------------
    Description: 
Docker changed its default mount propagation to "shared" since 1.12 to enable 
persistent volume plugins. However, Docker has a known issue 
(https://github.com/moby/moby/issues/25718) that it sometimes leaks its mount 
namespace to other processes, which could make Mesos agents fail to remove 
Docker containers during recovery. The following shows the logs of such a 
faliure:
{noformat}
I0615 09:39:11.083787  4573 docker.cpp:1002] Skipping recovery of executor 
'kafka__7e49099d-7ab4-4435-a94a-1e849b8f2b70' of framework 
44cbe3e9-984d-4073-b523-0023b427f54d-0011 because its executor is not marked as 
docker and the docker container doesn't exist
Failed to perform recovery: Collect failed: Collect failed: Failed to run 
'docker -H unix:///var/run/docker.sock rm -v 
2de71c5383cb887f3ee49de5a517545b0522e1bbcb5df618c7ddb8583fd1d12d': exited with 
status 1; stderr='Error response from daemon: Driver overlay failed to remove 
root filesystem 
2de71c5383cb887f3ee49de5a517545b0522e1bbcb5df618c7ddb8583fd1d12d: remove 
/var/lib/docker/overlay/221725ec545d60492b5431bb49380d868f7a949aaa3acff49f7ffb5bddeb3385/merged:
 device or resource busy
'
To remedy this do as follows:
Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest
This ensures agent doesn't recover old live executors.
Step 2: Restart the agent.
{noformat}


  was:Docker changed its default mount propagation to "shared" since 1.12 to 
enable persistent volume plugins. However, Docker has a known issue 
(https://github.com/moby/moby/issues/25718) that it sometimes leaks its mount 
namespace to other processes, which could make Mesos agents fail to remove 
Docker containers during recovery.


> Agent failed to recover due to mount namespace leakage in Docker 1.12/1.13
> --------------------------------------------------------------------------
>
>                 Key: MESOS-7777
>                 URL: https://issues.apache.org/jira/browse/MESOS-7777
>             Project: Mesos
>          Issue Type: Bug
>          Components: docker
>            Reporter: Chun-Hung Hsiao
>            Assignee: Chun-Hung Hsiao
>             Fix For: 1.4.0
>
>
> Docker changed its default mount propagation to "shared" since 1.12 to enable 
> persistent volume plugins. However, Docker has a known issue 
> (https://github.com/moby/moby/issues/25718) that it sometimes leaks its mount 
> namespace to other processes, which could make Mesos agents fail to remove 
> Docker containers during recovery. The following shows the logs of such a 
> faliure:
> {noformat}
> I0615 09:39:11.083787  4573 docker.cpp:1002] Skipping recovery of executor 
> 'kafka__7e49099d-7ab4-4435-a94a-1e849b8f2b70' of framework 
> 44cbe3e9-984d-4073-b523-0023b427f54d-0011 because its executor is not marked 
> as docker and the docker container doesn't exist
> Failed to perform recovery: Collect failed: Collect failed: Failed to run 
> 'docker -H unix:///var/run/docker.sock rm -v 
> 2de71c5383cb887f3ee49de5a517545b0522e1bbcb5df618c7ddb8583fd1d12d': exited 
> with status 1; stderr='Error response from daemon: Driver overlay failed to 
> remove root filesystem 
> 2de71c5383cb887f3ee49de5a517545b0522e1bbcb5df618c7ddb8583fd1d12d: remove 
> /var/lib/docker/overlay/221725ec545d60492b5431bb49380d868f7a949aaa3acff49f7ffb5bddeb3385/merged:
>  device or resource busy
> '
> To remedy this do as follows:
> Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest
> This ensures agent doesn't recover old live executors.
> Step 2: Restart the agent.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to