[
https://issues.apache.org/jira/browse/MESOS-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264874#comment-15264874
]
Jie Yu commented on MESOS-5239:
-------------------------------
The following patch allows the filesystem/linux isolator to skip the bind mount
for the agent's work_dir if possible:
https://reviews.apache.org/r/46858/
The above patch will solve this problem on Centos7, Ubuntu 16.04, CoreOS where
default mounts are 'shared'.
> Persistent volume DockerContainerizer support assumes proper mount
> propagation setup on the host.
> -------------------------------------------------------------------------------------------------
>
> Key: MESOS-5239
> URL: https://issues.apache.org/jira/browse/MESOS-5239
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Affects Versions: 0.28.0, 0.28.1
> Reporter: Jie Yu
> Assignee: Jie Yu
> Labels: mesosphere
> Fix For: 0.29.0, 0.28.2
>
>
> We recently added persistent volume support in DockerContainerizer
> (MESOS-3413). To understand the problem, we first need to understand how
> persistent volumes are supported in DockerContainerizer.
> To support persistent volumes in DockerContainerizer, we bind mount
> persistent volumes under a container's sandbox ('container_path' has to be
> relative for persistent volumes). When the Docker container is launched,
> since we always add a volume (-v) for the sandbox, the persistent volumes
> will be bind mounted into the container as well (since Docker does a 'rbind').
> The assumption that the above works is that the Docker daemon should see
> those persistent volume mounts that Mesos mounts on the host mount table.
> It's not a problem if Docker daemon itself is using the host mount namespace.
> However, on systemd enabled systems, Docker daemon is running in a separate
> mount namespace and all mounts in that mount namespace will be marked as
> slave mounts due to this
> [patch|https://github.com/docker/docker/commit/eb76cb2301fc883941bc4ca2d9ebc3a486ab8e0a].
> So what that means is that: in order for it to work, the parent mount of
> agent's work_dir should be a shared mount when docker daemon starts. This is
> typically true on CentOS7, CoreOS as all mounts are shared mounts by default.
> However, this causes an issue with the 'filesystem/linux' isolator. To
> understand why, first I need to show you a typical problem when dealing with
> shared mounts. Let me explain that using the following commands on a CentOS7
> machine:
> {noformat}
> [root@core-dev run]# cat /proc/self/mountinfo
> 24 60 0:19 / /run rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755
> [root@core-dev run]# mkdir /run/netns
> [root@core-dev run]# mount --bind /run/netns /run/netns
> [root@core-dev run]# cat /proc/self/mountinfo
> 24 60 0:19 / /run rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755
> 121 24 0:19 /netns /run/netns rw,nosuid,nodev shared:22 - tmpfs tmpfs
> rw,seclabel,mode=755
> [root@core-dev run]# ip netns add test
> [root@core-dev run]# cat /proc/self/mountinfo
> 24 60 0:19 / /run rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755
> 121 24 0:19 /netns /run/netns rw,nosuid,nodev shared:22 - tmpfs tmpfs
> rw,seclabel,mode=755
> 162 121 0:3 / /run/netns/test rw,nosuid,nodev,noexec,relatime shared:5 - proc
> proc rw
> 163 24 0:3 / /run/netns/test rw,nosuid,nodev,noexec,relatime shared:5 - proc
> proc rw
> {noformat}
> As you can see above, there're two entries (/run/netns/test) in the mount
> table (unexpected). This will confuse some systems sometimes. The reason is
> because when we create a self bind mount (/run/netns -> /run/netns), the
> mount will be put into the same shared mount peer group (shared:22) as its
> parent (/run). Then, when you create another mount underneath that
> (/run/netns/test), that mount operation will be propagated to all mounts in
> the same peer group (shared:22), resulting an unexpected additional mount
> being created.
> The reason we need to do a self bind mount in Mesos is that sometimes, we
> need to make sure some mounts are shared so that it does not get copied when
> a new mount namespace is created. However, on some systems, mounts are
> private by default (e.g., Ubuntu 14.04). In those cases, since we cannot
> change the system mounts, we have to do a self bind mount so that we can set
> mount propagation to shared. For instance, in filesytem/linux isolator, we do
> a self bind mount on agent's work_dir.
> To avoid the self bind mount pitfall mentioned above, in filesystem/linux
> isolator, after we created the mount, we do a make-slave + make-shared so
> that the mount is its own shared mount peer group. In that way, any mounts
> underneath it will not be propagated back.
> However, that operation will break the assumption that the persistent volume
> DockerContainerizer support makes. As a result, we're seeing problem with
> persistent volumes in DockerContainerizer when filesystem/linux isolator is
> turned on.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)