On 5/18/19 5:49 PM, Rich Freeman wrote:
I'd be interested if there are other scripts people have put out
there, but I agree that most of the container solutions on Linux
are overly-complex.
Here's what I use for some networking, which probably qualifies as
extremely light weight ""containers.
Prerequisite: Create a place for the name spaces to anchor:
# Create the directories to contain the *NS mount points.
sudo mkdir -p /run/{mount,net,uts}ns
You can use any path that you want. — I do a lot with iproute2's
network namespaces (which is where this evolved from), which use
/run/netns/$NetNSname. So I used that as a pattern for the other types
of namespaces. Adjust as you want. — What I'm doing is interoperable
with iproute2's netns command.
Per ""Container: Create the ""Containers mount points:
# Create the *NS mount points
sudo touch /run/{mount,net,uts}ns/$ContainerName
Start the actual namespaces:
# Spawn the lab# NetNSs.
unshare --mount=/run/mountns/$ContainerName
--net=/run/netns/$ContainerName --uts=/run/utsns/$ContainerName /bin/true
Note: The namespaces don't die when true exits because they are
associated with a mount point.
Tweak the namespaces:
# Set the lab# NetNS's hostname.
nsenter --mount=/run/mountns/$ContainerName
--net=/run/netns/$ContainerName --uts=/run/utsns/$ContainerName
/bin/hostname $ContainerName
I reuse this command calling different binaries any time I want to do
something in the ""container. Calling /bin/bash (et al.) enters the
container.
I've created a wrapper script (nsenter.wrapper) that passes the proper
parameters to nsenter. I've then sym-linked the container name to the
nsenter.wrapper script. This means that I can run "$ContainerName
$Command" or simply enter the container with $ContainerName. (The
script checks the number of parameters and assumes /bin/bash if no
command is specified.
I think it's ultimately extremely trivial to have a ""container
(glorified collection of name spaces) to do things I want with virtually
zero disk space. Ok, ok, maybe 1 or 2 kB for the script & links.
Note: Since I'm using the mount name space, I can have a completely
different mount tree inside the ""container than I have outside the
container / on the host. I'm not currently doing that, but it's
possible to change things as desired.
I personally use nspawn, which is actually pretty minimal, but it
depends on systemd, which I'm sure many would argue is overly complex.
:) However, if you are running systemd you can basically do a
one-liner that requires zero setup to turn a chroot into a container.
As much as I might not like systemd, if you have it, and it reliably
does what you want, then I see no reason to /not/ use it. Just
acknowledge it as a dependency on your solution, which you have done.
So I think we're cool.
On to the original questions about mounts:
In general you can mount stuff in containers without issue. There are
two ways to go about it. One is to mount something on the host and
bind-mount it into the container, typically at launch time. The other
is to give the container the necessary capabilities so that it can
do its own mounting (typically containers are not given the necessary
capabilities, so mounting will fail even as root inside the container).
Given that one of the uses of containers is security isolation (such as
it is), I feel like giving the container the ability to mount things is
less than a stellar idea. But to each his / her own.
I believe the reason the wiki says to be careful with mounts has more
to do with UID/GID mapping. As you are using nfs this is already an
issue you're probably dealing with. You're probably aware that running
nfs with multiple hosts with unsynchronized passwd/group files can
be tricky, because linux (and unix in general) works with UIDs/GIDs,
and not really directly with names,
That's true for NFS v1-3. But NFS v4 changes that. NFS v4 actually
uses user names & group names and has a daemon that runs on the client &
server to translate things as necessary.
so if you're doing something with one UID on one host and with a
different UID on another host you might get unexpected permissions
behavior.
Yep. You need to do /something/ to account for this. Be it manually
manage UID & GID across things, or use something like NFSv4's
synchronization mechanism.
In a nutshell the same thing can happen with containers, or for
that matter with chroots.
I mostly agree. However, user namespaces can nullify this.
I've not dabbled with user namespaces yet, but my understanding is that
they can have completely different UIDs & GIDs inside the user namespace
than outside of it. It's my understanding that UID 0 / GID 0 inside a
user namespace can be mapped to UID 12345 / GID 23456 outside of the
user namespace. Refer to nsenter / unshare man pages for more details.
If you have identical passwd/group files it should be a non-issue.
Point of order: The files don't need to be identical. The UIDs & GIDs
need to be managed if you aren't using something like user namespaces.
So it's perfectly valid to have a text file that is used to coordinate
UIDs & GIDs somewhere and then use those in passw/shadow group/gshadow
files.
However, if you want to do mapping with unprivileged containers
you have to be careful with mounts as they might not get translated
properly. Using completely different UIDs in a container is their
suggested solution, which is fine as long as the actual container
filesystem isn't shared with anything else.
I conceptually agree. However I think mount namespaces combined with
user namespaces muddy the water. Again, refer to the nsenter / unshare
man pages and what they refer to.
nsenter has an option for sharing something between mount namespaces. I
have no idea what it does, much less how it does it. I suspect that the
kernel mounts it once (maybe not visible from anywhere else) and then
bind-mounts it to multiple locations for visibility / access.
That tends to be the case anyway when you're using container
implementations that do a lot of fancy image management. If you're
doing something very minimal and just using a path/chroot on the host
as your container then you need to be mindful of your UIDs/GIDs if
you go accessing anything from the host directly.
UID & GID management is important. /Something/ should be doing it.
The other thing I'd be careful with is mounting physical devices in
more than one place. Since you're actually sharing a kernel I suspect
linux will "do the right thing" if you mount an ext4 on /dev/sda2 on
two different containers, but I've never tried it (and again doing
that requires giving containers access to even see sda2 because they
probably won't see physical devices by default).
Seeing as how the containers are running under the same kernel, there is
no actual need for the file system to be mounted multiple times.
Instead the kernel would mount it and present it, much like a bind
mount, to multiple containers for access.
Think along the lines of opening and working with a file system as a
separate process from where it's presented for access. Conceptually not
that dissimilar to a hard link that has multiple representations of a
file in multiple locations on the same file system. (It's not a perfect
analogy, but I hope that makes sense.)
In a VM environment you definitely can't do this, because the VMs
are completely isolated at the kernel level and having two different
kernels having dirty buffers on the same physical device is going
to kill any filesystem that isn't designed to be clustered.
Technically, you can usually get away with doing this. But the mounts
need to be read-only. But I STRONGLY suggest that you NOT do this to a
non-cluster aware file system.
I have colleagues that supported systems RO mounting an Ext file system
this way. It worked okay when it was used as a RO library. The problem
was when they made changes in the one with RW access. They needed to
unmount and remount all the RO clients to see the updates. It was not
graceful and we advised that they stop doing that. But it did work for
their needs. They used it akin to a bit (~TB) CD-ROM.
In a container environment the two containers aren't really isolated
at the actual physical filesystem level since they share the kernel,
I think mount namespaces muddy this water. Yes, it's the same kernel,
but the containers don't have the same file systems exposed to the
container.
so I think you'd be fine but I'd really want to test or do some
research before relying on it.
Yes, test.
But make sure you have a vague understanding of what's actually
happening behind the scenes. I find that tremendously helpful in
knowing what can and can't be done, as well as why.
In any case, the more typical solution is to just mount everything on
the host and then bind-mount it into the container. So, you could
mount the nfs in /mnt and then bind-mount that into your container.
There is really no performance hit and it should work fine without
giving the container a bunch of capabilities.
I think there /is/ a performance hit. It's just so /minimal/ that it's
effectively non-existent. Every additional line of code in the path
that must be traversed does take CPU cycles.