Thanks Uroš for reporting and Faidon for the analysis! On Sun, 18 Aug 2024 21:15:23 +0300 Faidon Liambotis <[email protected]> wrote:
> On Sun, Aug 18, 2024 at 04:47:39PM +0200, Uroš Knupleš wrote: > > [...] > > > Interestingly, this kernel message pops up every time an container > > is brought up as an non-root user: > > > > [ 361.611472] audit: type=1326 audit(1723988353.266:23): auid=1000 > > uid=1000 gid=1000 ses=1 subj=pasta pid=1394 comm="pasta" > > exe="/usr/bin/pasta" sig=31 arch=40000003 syscall=403 compat=0 > > ip=0xb7fb0579 code=0x80000000 > > This is indeed the smoking gun. You can parse these messages manually > (by looking at audit.h, syscall etc. values in headers), or just install > auditd (apt install auditd), and tail /var/log/audit/audit.log. > > In this case, 1326 is type=SECCOMP, and syscall 403 is > "clock_gettime64". Right, and this terminates the process right away, as clock_gettime() is called at every event (inbound data or outbound packets received), in passt.c, after the seccomp filter is in place (isolate_postfork(), isolation.c). > It looks like the passt source code includes a shell script, that parses > "syscall:" comments and generates seccomp filters for them. (It does not > use libseccomp). Correct, we don't use libseccomp for several reasons, including the advantage of this mechanism based on comments, but also to avoid a dependency, and to optimise the system call lookup tree in the BPF program to our (simple) needs. > In this case, there is a comment that states: > * #syscalls clock_gettime arm:clock_gettime64 > ...but on i386, and likely other 32-bit architectures (like 32-bit arm, > which is seemingly already handled), glibc's clock_gettime() is wrapping > the clock_gettime64 syscall. I tested the full functionality on armhf quite recently, so I don't think there should be issues with this, but I'll give it another run. > Adding i686:clock_gettime64¹ to that line addresses this specific > occurence, but moves the goalpost a bit further: after a few iterations, > I found that the "fcntl64", "socketcall" and "recvmmsg_time64" also need > to be allowlisted. Thanks, that will definitely save some time. > By adjusting source code comments to add these 4 syscalls in their > relevant spots and rebuilding passt, I managed to get "podman run --rm > -it" to work on i386. Note however that this is a rudimentary test and > for example only exercises the "pasta" code path; someone more familiar > with passt/pasta should probably verify other code paths as well. It'd > be a good idea to involve upstream. I'll run the full test suite on i686 and check if anything is missing. Unfortunately, I can't easily turn the existing upstream test suite into an autopkgtest, because it's rather complicated as it involves setting up guests with throughput tests. But we're working on a new approach to the test framework that should eventually enable some degree of modularity, and make running it as part of autopkgtests feasible. > Hope this helps! > Faidon > > 1: "i686" because seccomp.sh calls `uname -m` if there is no TARGET > specified, which I think is a (cross-)portability bug of its own... On Debian builds (and I guess most other distribution builds, really) TARGET is always passed, we use `uname -m` as a fallback option only, so that if you just want to type 'make', you can do that. See also https://salsa.debian.org/reproducible-builds/reprotest/-/issues/6#note_386163 -- Stefano

