[ 
https://issues.apache.org/jira/browse/MESOS-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264874#comment-15264874
 ] 

Jie Yu commented on MESOS-5239:
-------------------------------

The following patch allows the filesystem/linux isolator to skip the bind mount 
for the agent's work_dir if possible:
https://reviews.apache.org/r/46858/

The above patch will solve this problem on Centos7, Ubuntu 16.04, CoreOS where 
default mounts are 'shared'.

> Persistent volume DockerContainerizer support assumes proper mount 
> propagation setup on the host.
> -------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-5239
>                 URL: https://issues.apache.org/jira/browse/MESOS-5239
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>    Affects Versions: 0.28.0, 0.28.1
>            Reporter: Jie Yu
>            Assignee: Jie Yu
>              Labels: mesosphere
>             Fix For: 0.29.0, 0.28.2
>
>
> We recently added persistent volume support in DockerContainerizer 
> (MESOS-3413). To understand the problem, we first need to understand how 
> persistent volumes are supported in DockerContainerizer.
> To support persistent volumes in DockerContainerizer, we bind mount 
> persistent volumes under a container's sandbox ('container_path' has to be 
> relative for persistent volumes). When the Docker container is launched, 
> since we always add a volume (-v) for the sandbox, the persistent volumes 
> will be bind mounted into the container as well (since Docker does a 'rbind').
> The assumption that the above works is that the Docker daemon should see 
> those persistent volume mounts that Mesos mounts on the host mount table. 
> It's not a problem if Docker daemon itself is using the host mount namespace. 
> However, on systemd enabled systems, Docker daemon is running in a separate 
> mount namespace and all mounts in that mount namespace will be marked as 
> slave mounts due to this 
> [patch|https://github.com/docker/docker/commit/eb76cb2301fc883941bc4ca2d9ebc3a486ab8e0a].
> So what that means is that: in order for it to work, the parent mount of 
> agent's work_dir should be a shared mount when docker daemon starts. This is 
> typically true on CentOS7, CoreOS as all mounts are shared mounts by default.
> However, this causes an issue with the 'filesystem/linux' isolator. To 
> understand why, first I need to show you a typical problem when dealing with 
> shared mounts. Let me explain that using the following commands on a CentOS7 
> machine:
> {noformat}
> [root@core-dev run]# cat /proc/self/mountinfo
> 24 60 0:19 / /run rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755
> [root@core-dev run]# mkdir /run/netns
> [root@core-dev run]# mount --bind /run/netns /run/netns
> [root@core-dev run]# cat /proc/self/mountinfo
> 24 60 0:19 / /run rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755
> 121 24 0:19 /netns /run/netns rw,nosuid,nodev shared:22 - tmpfs tmpfs 
> rw,seclabel,mode=755
> [root@core-dev run]# ip netns add test
> [root@core-dev run]# cat /proc/self/mountinfo
> 24 60 0:19 / /run rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755
> 121 24 0:19 /netns /run/netns rw,nosuid,nodev shared:22 - tmpfs tmpfs 
> rw,seclabel,mode=755
> 162 121 0:3 / /run/netns/test rw,nosuid,nodev,noexec,relatime shared:5 - proc 
> proc rw
> 163 24 0:3 / /run/netns/test rw,nosuid,nodev,noexec,relatime shared:5 - proc 
> proc rw
> {noformat}
> As you can see above, there're two entries (/run/netns/test) in the mount 
> table (unexpected). This will confuse some systems sometimes. The reason is 
> because when we create a self bind mount (/run/netns -> /run/netns), the 
> mount will be put into the same shared mount peer group (shared:22) as its 
> parent (/run). Then, when you create another mount underneath that 
> (/run/netns/test), that mount operation will be propagated to all mounts in 
> the same peer group (shared:22), resulting an unexpected additional mount 
> being created.
> The reason we need to do a self bind mount in Mesos is that sometimes, we 
> need to make sure some mounts are shared so that it does not get copied when 
> a new mount namespace is created. However, on some systems, mounts are 
> private by default (e.g., Ubuntu 14.04). In those cases, since we cannot 
> change the system mounts, we have to do a self bind mount so that we can set 
> mount propagation to shared. For instance, in filesytem/linux isolator, we do 
> a self bind mount on agent's work_dir.
> To avoid the self bind mount pitfall mentioned above, in filesystem/linux 
> isolator, after we created the mount, we do a make-slave + make-shared so 
> that the mount is its own shared mount peer group. In that way, any mounts 
> underneath it will not be propagated back.
> However, that operation will break the assumption that the persistent volume 
> DockerContainerizer support makes. As a result, we're seeing problem with 
> persistent volumes in DockerContainerizer when filesystem/linux isolator is 
> turned on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to