On Tue, 23 Aug 2022 at 11:35:22 +0100, Simon McVittie wrote:
> The failure is that in test-unshare (which tests the unshare backend),
> several test-cases run autopkgtest against a mock package in which a
> test script "debian/tests/$test_name" emits known strings on its stdout,
> and assert that autopkgtest's stdout contains those strings.
> In fact [the required string] does not always appear, causing
> a test failure.
The best theory I have so far is that when autopkgtest runs a
dynamically-generated shell script (!) in the testbed, one of the things
it does is (simplifying a bit):
touch /scratch/mytest-stdout /scratch/mytest-stderr
debian/tests/mytest \
2> >(tee -a /scratch/mytest-stderr >&2) \
> >(tee -a /scratch/mytest-stdout)
After mytest exits, we have a race condition. If we're lucky, which in
practice most of the time we are, then the tee processes get scheduled
before this shell script exits, read the remaining output from mytest,
write it to their respective files and to the stdout/stderr of the
"auxverb" (which end up being autopkgtest's own stdout/stderr), and exit.
However, if we're unlucky, then that script exits as a result of
mytest exiting, and the su or bash command that is running the script
also exits. That su or bash command is pid 1 in the container (because
we're running in a new pid namespace due to unshare --pid in the last
line of lib/unshare-helper), so when it exits, the kernel unceremoniously
kills every process that remains inside the container with the equivalent
of kill -9, notably including the two tee processes. Anything buffered
inside those tee processes or in the pipe used for the command substitution
is lost.
The lxc, lxd and Docker/Podman container backends do not have this
problem, because they either run a full init system as pid 1 (virt-lxc,
virt-lxd, virt-podman --init) or run a dummy 'sleep infinity' process as
pid 1 (virt-docker, virt-podman), so the container is not shut down until
we are ready for it to be shut down, which we do not do until after the
test itself has finished, with enough arbitrary delays that we do not lose
the race in practice. The notable thing that unshare is doing differently
that makes it more likely to lose this race is that it's repeatedly
creating and destroying containers (in the sense of a collection of
linked namespaces) for the same root filesystem, once per test-case.
Does that seem plausible?
If this theory is correct, then the ideal thing to do here would be to
run the two tee processes as explicitly background processes (with the &
operator), and then after debian/tests/mytest exits, explicitly wait(1)
for them. However, debian/tests/mytest might have leaked background
processes that still have the test's stdout/stderr open for writing.
If it has, then a wait(1) for the two tee processes will wait forever,
unless we explicitly iterate through processes that (might) have the fd
open and kill them (similar to the rather unpleasant code in virt-lxc
and virt-lxd).
If we were able to rely on running arbitrary native code in the testbed,
then we would be able to use a subreaper to capture the entire process
hierarchy below the test script, then iterate through /proc killing
direct children of the subreaper until there are no more processes. I even
have tested C code for this, in an unrelated project from my day job
(steam-runtime-tools). Unfortunately, the testbed is (in principle) a
minbase tarball, from an arbitrary Debian suite which can be much older
than autopkgtest itself, and potentially even a different architecture,
so we cannot rely on being able to inject arbitrary binaries into it;
and we also can't have arbitrary dependencies. So we probably have to
implement this in shell script with one hand tied behind our backs.
I'm not treating this as a release blocker right now, but I think it
does need to be solved if we are going to consider virt-unshare to be
a first-class-citizen autopkgtest backend, because at the moment we have
to ignore its Gitlab-CI failures, and that's not great for having
confidence that it works in practice.
smcv