I ran into this on jammy/amd64: https://autopkgtest.ubuntu.com/results/autopkgtest- jammy/jammy/amd64/l/livecd-rootfs/20240121_173406_e4f9a@/log.gz
I downloaded all of the amd64 failures and searched for this failure pattern. These were the kernels that were running at the time: "Linux 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023" "Linux 6.2.0-21-generic #21-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 14 12:34:02 UTC 2023" "Linux 6.3.0-7-generic #7-Ubuntu SMP PREEMPT_DYNAMIC Thu Jun 8 16:02:30 UTC 2023" "Linux 6.5.0-9-generic #9-Ubuntu SMP PREEMPT_DYNAMIC Sat Oct 7 01:35:40 UTC 2023" "Linux 6.6.0-14-generic #14-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 30 10:27:29 UTC 2023" Here's the count of failures per image type: 12 017-disk-image-uefi.binary 3 018-disk-image.binary 3 020-kvm-image.binary 1 023-vagrant.binary 1 024-vagrant.binary I can confirm that /dev/loop0p1 is created by devtmpfs. This surprised me because I'd never actually need to know what devtmpfs was, and I saw devices being created even though I had SIGSTOP'd systemd-udevd. But watching udevadm monitor and forktrace output convinced me. I had a theory that something was opening the first created partition before all partitions were created. loop_reread_partitions() can fail without returning an error to userspace: https://elixir.bootlin.com/linux/v5.15.147/source/drivers/block/loop.c#L676 that could happen if bdev_disk_changed() aborts because it finds another partition on the device is open: https://elixir.bootlin.com/linux/v5.15.147/source/block/partitions/core.c#L662 But then we should see this in dmesg: pr_warn("%s: partition scan of loop%d (%s) failed (rc=%d)\n" I added dmesg calls to check that: https://autopkgtest.ubuntu.com/results/autopkgtest-jammy-dannf-test/jammy/amd64/l/livecd-rootfs/20240122_161631_62ecd@/log.gz .. but no such message appeared, so that's not it. But what *is* interesting there is that it shows *2* partition scan lines: 1248s [ 990.855361] loop0: detected capacity change from 0 to 4612096 1248s [ 990.855628] loop0: p1 p14 p15 1248s [ 990.874241] loop0: p1 p14 p15 Previously we just saw 1: 1189s [ 932.268459] loop0: detected capacity change from 0 to 4612096 1189s [ 932.268715] loop0: p1 p14 p15 That only gets printed when bdev_disk_changed() is called. So do we have 2 racing callers? One thing that seems off is that loop_configure() unsuppresses uevents for the full device before the partition scan, but loop_change_fd() waits until the partition scan is complete. Shouldn't they be following the same pattern? I wonder if that could cause the following race: [livecd-rootfs] losetup creates /dev/loop0 [livecd-rootfs] kernel sends uevent for /dev/loop0 [livecd-rootfs] /dev/loop0p* appear in devtmpfs [udev] receives uevent for loop0 [udev] partprobe /dev/loop0 [livecd-rootfs] losetup exit(0) [partprobe] /dev/loop0p* cleared [livecd-rootfs] check for /dev/loop0p1 FAILS [partprobe] /dev/loop0p* recreated I tried checking for this using ftrace in a local jammy VM. I haven't been able to reproduce this in a local VM, but I wanted to see what happens in a normal losetup.. er... setup. >>>>> First I used losetup to create the device: root@dannf-livecd-rootfs-debug:/sys/kernel/debug/tracing# loopdev="$(losetup --show -f -P -v /home/ubuntu/disk.img)" root@dannf-livecd-rootfs-debug:/sys/kernel/debug/tracing# cat trace # tracer: function # # entries-in-buffer/entries-written: 1/1 #P:1 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | losetup-1996 [000] ..... 657.573994: bdev_disk_changed <-loop_reread_partitions >>>>> Only the expected bdev_disk_change() call >>>>> Then I remove the device: root@dannf-livecd-rootfs-debug:/sys/kernel/debug/tracing# losetup -v -d $loopdev root@dannf-livecd-rootfs-debug:/sys/kernel/debug/tracing# cat trace # tracer: function # # entries-in-buffer/entries-written: 3/3 #P:1 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | losetup-1996 [000] ..... 657.573994: bdev_disk_changed <-loop_reread_partitions systemd-udevd-2524 [000] ..... 680.555336: bdev_disk_changed <-blkdev_get_whole losetup-2523 [000] ..... 680.568473: bdev_disk_changed <-__loop_clr_fd >>>>> udev did rescan the partitions... but only during removal. And its strange that it gets in there before losetup calls it itself. Perhaps I should add this tracing into a debug livecd-rootfs and see if I can reproduce in our autopkgtest infra. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2045586 Title: livecd-rootfs uses losetup -P for theoretically reliable/synchronous partition setup but it's not reliable in noble Status in linux package in Ubuntu: New Status in livecd-rootfs package in Ubuntu: New Status in util-linux package in Ubuntu: New Bug description: In mantic, we migrated livecd-rootfs to use losetup -P instead of kpartx, with the expectation that this would give us a reliable, race- free way of loop-mounting partitions from a disk image during image build. In noble, we are finding that it is no longer reliable, and in fact fails rather often. It is most noticeable with riscv64 builds, which is the architecture where we most frequently ran into problems before with kpartx. The first riscv64+generic build in noble where the expected loop partition device is not available is https://launchpad.net/~ubuntu- cdimage/+livefs/ubuntu/noble/cpc/+build/531790 The failure is however not unique to riscv64, and the autopkgtest for the latest version of livecd-rootfs (24.04.7) - an update that specifically tries to add more debugging code for this scenario - has also failed on ppc64el. https://autopkgtest.ubuntu.com/packages/l/livecd- rootfs/noble/ppc64el The first failure happened on November 16. While there has been an update to the util-linux package in noble, this did not land until November 23. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2045586/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp