On 5/18/19 5:49 PM, Rich Freeman wrote:
I'd be interested if there are other scripts people have put out there, but I agree that most of the container solutions on Linux are overly-complex.

Here's what I use for some networking, which probably qualifies as extremely light weight ""containers.

Prerequisite:  Create a place for the name spaces to anchor:

   # Create the directories to contain the *NS mount points.
   sudo mkdir -p /run/{mount,net,uts}ns

You can use any path that you want. — I do a lot with iproute2's network namespaces (which is where this evolved from), which use /run/netns/$NetNSname. So I used that as a pattern for the other types of namespaces. Adjust as you want. — What I'm doing is interoperable with iproute2's netns command.

Per ""Container:  Create the ""Containers mount points:

   # Create the *NS mount points
   sudo touch /run/{mount,net,uts}ns/$ContainerName

Start the actual namespaces:

   # Spawn the lab# NetNSs.
unshare --mount=/run/mountns/$ContainerName --net=/run/netns/$ContainerName --uts=/run/utsns/$ContainerName /bin/true

Note: The namespaces don't die when true exits because they are associated with a mount point.

Tweak the namespaces:

   # Set the lab# NetNS's hostname.
nsenter --mount=/run/mountns/$ContainerName --net=/run/netns/$ContainerName --uts=/run/utsns/$ContainerName /bin/hostname $ContainerName

I reuse this command calling different binaries any time I want to do something in the ""container. Calling /bin/bash (et al.) enters the container.

I've created a wrapper script (nsenter.wrapper) that passes the proper parameters to nsenter. I've then sym-linked the container name to the nsenter.wrapper script. This means that I can run "$ContainerName $Command" or simply enter the container with $ContainerName. (The script checks the number of parameters and assumes /bin/bash if no command is specified.

I think it's ultimately extremely trivial to have a ""container (glorified collection of name spaces) to do things I want with virtually zero disk space. Ok, ok, maybe 1 or 2 kB for the script & links.

Note: Since I'm using the mount name space, I can have a completely different mount tree inside the ""container than I have outside the container / on the host. I'm not currently doing that, but it's possible to change things as desired.

I personally use nspawn, which is actually pretty minimal, but it depends on systemd, which I'm sure many would argue is overly complex. :) However, if you are running systemd you can basically do a one-liner that requires zero setup to turn a chroot into a container.

As much as I might not like systemd, if you have it, and it reliably does what you want, then I see no reason to /not/ use it. Just acknowledge it as a dependency on your solution, which you have done. So I think we're cool.

On to the original questions about mounts:

In general you can mount stuff in containers without issue. There are two ways to go about it. One is to mount something on the host and bind-mount it into the container, typically at launch time. The other is to give the container the necessary capabilities so that it can do its own mounting (typically containers are not given the necessary capabilities, so mounting will fail even as root inside the container).

Given that one of the uses of containers is security isolation (such as it is), I feel like giving the container the ability to mount things is less than a stellar idea. But to each his / her own.

I believe the reason the wiki says to be careful with mounts has more to do with UID/GID mapping. As you are using nfs this is already an issue you're probably dealing with. You're probably aware that running nfs with multiple hosts with unsynchronized passwd/group files can be tricky, because linux (and unix in general) works with UIDs/GIDs, and not really directly with names,

That's true for NFS v1-3. But NFS v4 changes that. NFS v4 actually uses user names & group names and has a daemon that runs on the client & server to translate things as necessary.

so if you're doing something with one UID on one host and with a different UID on another host you might get unexpected permissions behavior.

Yep. You need to do /something/ to account for this. Be it manually manage UID & GID across things, or use something like NFSv4's synchronization mechanism.

In a nutshell the same thing can happen with containers, or for that matter with chroots.

I mostly agree.  However, user namespaces can nullify this.

I've not dabbled with user namespaces yet, but my understanding is that they can have completely different UIDs & GIDs inside the user namespace than outside of it. It's my understanding that UID 0 / GID 0 inside a user namespace can be mapped to UID 12345 / GID 23456 outside of the user namespace. Refer to nsenter / unshare man pages for more details.

If you have identical passwd/group files it should be a non-issue.

Point of order: The files don't need to be identical. The UIDs & GIDs need to be managed if you aren't using something like user namespaces. So it's perfectly valid to have a text file that is used to coordinate UIDs & GIDs somewhere and then use those in passw/shadow group/gshadow files.

However, if you want to do mapping with unprivileged containers you have to be careful with mounts as they might not get translated properly. Using completely different UIDs in a container is their suggested solution, which is fine as long as the actual container filesystem isn't shared with anything else.

I conceptually agree. However I think mount namespaces combined with user namespaces muddy the water. Again, refer to the nsenter / unshare man pages and what they refer to.

nsenter has an option for sharing something between mount namespaces. I have no idea what it does, much less how it does it. I suspect that the kernel mounts it once (maybe not visible from anywhere else) and then bind-mounts it to multiple locations for visibility / access.

That tends to be the case anyway when you're using container implementations that do a lot of fancy image management. If you're doing something very minimal and just using a path/chroot on the host as your container then you need to be mindful of your UIDs/GIDs if you go accessing anything from the host directly.

UID & GID management is important.  /Something/ should be doing it.

The other thing I'd be careful with is mounting physical devices in more than one place. Since you're actually sharing a kernel I suspect linux will "do the right thing" if you mount an ext4 on /dev/sda2 on two different containers, but I've never tried it (and again doing that requires giving containers access to even see sda2 because they probably won't see physical devices by default).

Seeing as how the containers are running under the same kernel, there is no actual need for the file system to be mounted multiple times. Instead the kernel would mount it and present it, much like a bind mount, to multiple containers for access.

Think along the lines of opening and working with a file system as a separate process from where it's presented for access. Conceptually not that dissimilar to a hard link that has multiple representations of a file in multiple locations on the same file system. (It's not a perfect analogy, but I hope that makes sense.)

In a VM environment you definitely can't do this, because the VMs are completely isolated at the kernel level and having two different kernels having dirty buffers on the same physical device is going to kill any filesystem that isn't designed to be clustered.

Technically, you can usually get away with doing this. But the mounts need to be read-only. But I STRONGLY suggest that you NOT do this to a non-cluster aware file system.

I have colleagues that supported systems RO mounting an Ext file system this way. It worked okay when it was used as a RO library. The problem was when they made changes in the one with RW access. They needed to unmount and remount all the RO clients to see the updates. It was not graceful and we advised that they stop doing that. But it did work for their needs. They used it akin to a bit (~TB) CD-ROM.

In a container environment the two containers aren't really isolated at the actual physical filesystem level since they share the kernel,

I think mount namespaces muddy this water. Yes, it's the same kernel, but the containers don't have the same file systems exposed to the container.

so I think you'd be fine but I'd really want to test or do some research before relying on it.

Yes, test.

But make sure you have a vague understanding of what's actually happening behind the scenes. I find that tremendously helpful in knowing what can and can't be done, as well as why.

In any case, the more typical solution is to just mount everything on the host and then bind-mount it into the container. So, you could mount the nfs in /mnt and then bind-mount that into your container. There is really no performance hit and it should work fine without giving the container a bunch of capabilities.

I think there /is/ a performance hit. It's just so /minimal/ that it's effectively non-existent. Every additional line of code in the path that must be traversed does take CPU cycles.

Reply via email to