(Resent after subscription, as non-subscribers get rejected.)
Hello, I opened the initial Debian bug report, but did took the time to ask at systemd-devel and found this thread was already asked, so I am trying to provide further information.
> Do you have any MACs in effect? No SELinux or Apparmor active
As far as I see in my test VM with minimal Debian Buster there is no SELinux. "aa-status" returns "apparmor module is loaded.", but I did not intentionally configure anything to it.
> Does the host use cgroupsv2 or cgroupsv2 or hybrid? The host system uses systemd v241, compiled with default-hierarchy=hybrid > Was the container configured to use either? The container uses systemd v251 with default-hierarchy=unified
At the host: # systemd --version systemd 241 (241) +PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS \ +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid In the container: # systemd --version systemd 252 (252.2-1) +PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID \ +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY \ -P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK -XKBCOMMON +UTMP \ +SYSVINIT default-hierarchy=unified
> What is mounted to /sys/fs/cgroup and below?
At the host: # mount | grep /sys/fs/cgroup tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755) cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
> This is new payload on old host?
Yes, it is an test to use on an older Debian Buster with kernel 4.19.260-1 a quite recent Debian Bookworm/testing system.
> if you force container into cgroupsv1 mode as the host (by adding > systemd.unified_cgroup_hierarchy=no to the nspawn cmdline, does that > work?
I am not sure if I am using it right, but as far as I see "systemd.unified_cgroup_hierarchy=no" does not help. I added "debug" too, see below in [1].
> Also, please provide the relevant output from "strace -f -s 500 -y -o > /tmp/log.strace" (put on some pastebin)
Following pastebin contains the last quarter of the log.strace file recorded by the command in [1]: https://paste.debian.net/1262752/ I thought if strace can observe the process in question, would gdb also be able. And found starting nspawn with gdbserver, 'set follow-fork-mode child' and gdb from inside the container via plain chroot seems working well. So it looks like the failing "syscall_0x1b7" from strace is "faccessat2" [2]. And it seems "faccessat2" got added just in kernel 5.8 [3], therefore it might fail with the kernel 4.19. So I fear this needs a newer kernel, and/or this is more a glibc issue then? Kind regards, Bernhard [1] # strace -f -s 500 -y -o /tmp/log.strace systemd-nspawn --directory=/var/lib/machines/test-bookworm --boot systemd.unified_cgroup_hierarchy=no debug Spawning container test-bookworm on /var/lib/machines/test-bookworm. Press ^] three times within 1s to kill container. systemd 252.2-1 running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY -P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK -XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified) Detected virtualization systemd-nspawn. Detected architecture x86-64. Detected initialized system, this is not the first boot. Kernel version 4.19.0-22-amd64, our baseline is 4.15 Welcome to Debian GNU/Linux bookworm/sid! Hostname set to <debian>. sd-netlink: Failed to enable NETLINK_GET_STRICT_CHK option, ignoring: Protocol not available Failed to add address 127.0.0.1 to loopback interface: Operation not permitted Failed to add address ::1 to loopback interface: Operation not permitted Failed to bring loopback interface up: Operation not permitted Setting '/proc/sys/fs/file-max' to '9223372036854775807 ' No credentials passed via fw_cfg. Failed to open '/sys/firmware/dmi/entries/11-0/raw', ignoring: No such file or directory Found cgroup on /sys/fs/cgroup/systemd, legacy hierarchy Using cgroup controller name=systemd. File system hierarchy is at /sys/fs/cgroup/systemd. Failed to create /init.scope control group: Operation not permitted Failed to allocate manager object: Operation not permitted [!!!!!!] Failed to allocate manager object. Exiting PID 1... Container test-bookworm failed with error code 255. [2] (gdb) stepi 0x00007ffff79c93ec 29 int ret = INLINE_SYSCALL_CALL (faccessat2, fd, file, mode, flag); 1: x/i $pc => 0x7ffff79c93ec <__faccessat+44>: syscall (gdb) bt #0 0x00007ffff79c93ec in __faccessat (fd=fd@entry=-100, file=file@entry=0x7fffffffe3c0 "/sys/fs/cgroup/systemd", mode=mode@entry=0, flag=flag@entry=256) at ../sysdeps/unix/sysv/linux/faccessat.c:29 #1 0x00007ffff7c11380 in controller_is_v1_accessible (root=root@entry=0x0, controller=controller@entry=0x7ffff7f061ee "_systemd") at ../src/basic/cgroup-util.c:590 #2 0x00007ffff7c12432 in cg_get_path_and_check (controller=0x7ffff7f061ee "_systemd", path=0x7fffffffe4e0 "/init.scope", suffix=0x0, fs=0x7fffffffe480) at ../src/basic/cgroup-util.c:612 #3 0x00007ffff7b50eb0 in cg_create (controller=controller@entry=0x7ffff7f061ee "_systemd", path=path@entry=0x7fffffffe4e0 "/init.scope") at ../src/shared/cgroup-setup.c:292 #4 0x00007ffff7b511db in cg_create_and_attach (controller=controller@entry=0x7ffff7f061ee "_systemd", path=path@entry=0x7fffffffe4e0 "/init.scope", pid=pid@entry=0) at ../src/shared/cgroup-setup.c:324 #5 0x00007ffff7e3faa4 in manager_setup_cgroup (m=0x55555556edb0) at ../src/core/cgroup.c:3468 #6 0x00007ffff7ea463b in manager_new (scope=<optimized out>, test_run_flags=MANAGER_TEST_NORMAL, _m=_m@entry=0x7fffffffe600) at ../src/core/manager.c:939 #7 0x000055555555bf5c in main (argc=3, argv=0x7fffffffecd8) at ../src/core/main.c:2928 (gdb) print/x $eax $1 = 0x1b7 (gdb) stepi 29 int ret = INLINE_SYSCALL_CALL (faccessat2, fd, file, mode, flag); 1: x/i $pc => 0x7ffff79c93ee <__faccessat+46>: cmp $0xfffffffffffff000,%rax (gdb) print/x $eax $2 = 0xffffffff (gdb) list faccessat.c:29 24 25 26 int 27 __faccessat (int fd, const char *file, int mode, int flag) 28 { 29 int ret = INLINE_SYSCALL_CALL (faccessat2, fd, file, mode, flag); 30 #if __ASSUME_FACCESSAT2 31 return ret; 32 #else 33 if (ret == 0 || errno != ENOSYS) (gdb) list cgroup-util.c:590 585 /* If root if specified, we check that: 586 * - possible subcgroup is created at root, 587 * - we can modify the hierarchy. */ 588 589 cpath = strjoina("/sys/fs/cgroup/", dn, root, root ? "/cgroup.procs" : NULL); 590 return laccess(cpath, root ? W_OK : F_OK); 591 } 592 593 int cg_get_path_and_check(const char *controller, const char *path, const char *suffix, char **fs) { 594 int r; (gdb) list cgroup-util.c:612 607 * except for the named hierarchies */ 608 if (startswith(controller, "name=")) 609 return -EOPNOTSUPP; 610 } else { 611 /* Check if the specified controller is actually accessible */ 612 r = controller_is_v1_accessible(NULL, controller); 613 if (r < 0) 614 return r; 615 } 616 [3] https://bugs.archlinux.org/task/69563 https://man.archlinux.org/man/faccessat2.2.en "faccessat2() was added to Linux in version 5.8."