[ 
https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652677#comment-16652677
 ] 

James Peach commented on MESOS-9319:
------------------------------------

Prototype code looks promising. Currently, /dev is a tmpfs, but in this 
proposal it would be a bind mount to a real filesystem. I'm binding it in 
read-only to prevent disk quota escapes, which seems to work OK.

> Create all container devices at isolation time.
> -----------------------------------------------
>
>                 Key: MESOS-9319
>                 URL: https://issues.apache.org/jira/browse/MESOS-9319
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: James Peach
>            Assignee: James Peach
>            Priority: Major
>
> When using a custom user namespace isolator, the task fails at launch because 
> opening devices fails with a EPERM error. This problem is described in [this 
> system issue|https://github.com/systemd/systemd/pull/9483] and [this 
> lxd|https://github.com/lxc/lxd/issues/4950] issue.
> The problem arises in the Mesos containerizer due to the order of operations:
> # Clone the containerizer with {{CLONE_NEWNS}}
> # Mount a tmpfs for the devices
> # mknod for the various device nodes
> Referring back to the lxc issue, because we do (1) before (2), the tmpfs on 
> {{/dev}} is marked {{SB_I_NODEV}}. Due to the new 4.18 behavior, the mkdir in 
> (3) now succeeds (see commit 
> [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]).
>  Previously it would fail and we would fall back to bind mounting the device. 
> However, even though we created the device, we can't actually open it due to 
> the {{SB_I_NODEV}} flag on the tmpfs mount. It appears that the purpose of 
> allowing mknod is to that containers can create overlayfs whiteouts.
> One approach to deal with this in the Mesos containerizer is to complete the 
> device node cleanup that was begun in with the linux/devices isolator. This 
> approach involves moving all the responsibility for creating devices back to 
> the isolators. Then, at containerization time, we simply bind-mount the whole 
> of /dev from the per-container staging area. Since the isolators create the 
> devices in the host namespace and on the Mesos work directory, none of the 
> conditions that trigger the failure would be invoked.
> The failure we observed with our tasks was a failure to open {{/dev/null}}, 
> when redirecting it as standard input to a child process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to