On Fri, Mar 20, 2026 at 5:18 AM Harry Yoo <[email protected]> wrote: > > On Thu, Mar 19, 2026 at 04:37:45PM -0700, Nathan Chancellor wrote: > > Hi all, > > > > I am not really sure whose bug this is, as it only appears when three > > seemingly independent patch series are applied together, so I have added > > the patch authors and their committers (along with the tracing > > maintainers) to this thread. Feel free to expand or reduce that list as > > necessary. > > > > Our continuous integration has noticed a crash when booting > > ppc64_guest_defconfig in QEMU on the past few -next versions. > > > > > > https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/23311154492/job/67811527112 > > > > This does not appear to be clang related, as it can be reproduced with > > GCC 15.2.0 as well. Through multiple bisects, I was able to land on > > applying: > > > > mm: improve RSS counter approximation accuracy for proc interfaces [1] > > vdso/datastore: Allocate data pages dynamically [2] > > kho: fix deferred init of kho scratch [3] > > > > and their dependent changes on top of 7.0-rc4 is enough to reproduce > > this (at least on two of my machines with the same commands). I have > > attached the diff from the result of the following 'git apply' commands > > below, done in a linux-next checkout. > > > > $ git checkout v7.0-rc4 > > HEAD is now at f338e7738378 Linux 7.0-rc4 > > > > # [1] > > $ git diff > > 60ddf3eed4999bae440d1cf9e5868ccb3f308b64^..087dd6d2cc12c82945ab859194c32e8e977daae3 > > | git apply -3v > > ... > > > > # [2] > > # Fix trivial conflict in init/main.c around headers > > $ git diff > > dc432ab7130bb39f5a351281a02d4bc61e85a14a^..05988dba11791ccbb458254484826b32f17f4ad2 > > | git apply -3v > > ... > > > > # [3] > > # Fix conflict in kernel/liveupdate/kexec_handover.c due to lack of > > kho_mem_retrieve(), just add pfn_is_kho_scratch() > > $ git show 4a78467ffb537463486968232daef1e8a2f105e3 | git apply -3v > > ... > > > > $ make -skj"$(nproc)" ARCH=powerpc CROSS_COMPILE=powerpc64-linux- > > mrproper ppc64_guest_defconfig vmlinux > > > > $ curl -LSs > > https://github.com/ClangBuiltLinux/boot-utils/releases/download/20241120-044434/ppc64-rootfs.cpio.zst > > | zstd -d >rootfs.cpio > > > > $ qemu-system-ppc64 \ > > -display none \ > > -nodefaults \ > > -cpu power8 \ > > -machine pseries \ > > -vga none \ > > -kernel vmlinux \ > > -initrd rootfs.cpio \ > > -m 1G \ > > -serial mon:stdio > > Thanks, such a detailed steps to reproduce! > Interestingly, the combination of my compiler (GCC 13.3.0) and > QEMU (8.2.2) don't trigger this bug. > > > [ 0.000000][ T0] Linux version 7.0.0-rc4-dirty > > (nathan@framework-amd-ryzen-maxplus-395) (powerpc64-linux-gcc (GCC) 15.2.0, > > GNU ld (GNU Binutils) 2.45) #1 SMP PREEMPT Thu Mar 19 15:45:53 MST 2026 > > ... > > [ 0.216764][ T1] vgaarb: loaded > > [ 0.217590][ T1] clocksource: Switched to clocksource timebase > > [ 0.221007][ T12] BUG: Kernel NULL pointer dereference at 0x00000010 > > [ 0.221049][ T12] Faulting instruction address: 0xc00000000044947c > > [ 0.221237][ T12] Oops: Kernel access of bad area, sig: 11 [#1] > > [ 0.221276][ T12] BE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA > > pSeries > > [ 0.221359][ T12] Modules linked in: > > [ 0.221556][ T12] CPU: 0 UID: 0 PID: 12 Comm: kworker/u4:0 Not > > tainted 7.0.0-rc4-dirty #1 PREEMPTLAZY > > [ 0.221631][ T12] Hardware name: IBM pSeries (emulated by qemu) > > POWER8 (architected) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries > > [ 0.221765][ T12] Workqueue: trace_init_wq > > tracer_init_tracefs_work_func > > [ 0.222065][ T12] NIP: c00000000044947c LR: c00000000041a584 CTR: > > c00000000053aa90 > > [ 0.222084][ T12] REGS: c000000003bc7960 TRAP: 0380 Not tainted > > (7.0.0-rc4-dirty) > > [ 0.222111][ T12] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: > > 44000204 XER: 00000000 > > [ 0.222287][ T12] CFAR: c000000000449420 IRQMASK: 0 > > [ 0.222287][ T12] GPR00: c00000000041a584 c000000003bc7c00 > > c000000001c08100 c000000002892f20 > > [ 0.222287][ T12] GPR04: c0000000019cfa68 c0000000019cfa60 > > 0000000000000001 0000000000000064 > > [ 0.222287][ T12] GPR08: 0000000000000002 0000000000000000 > > c000000003bba000 0000000000000010 > > [ 0.222287][ T12] GPR12: c00000000053aa90 c000000002c50000 > > c000000001ab25f8 c000000001626690 > > [ 0.222287][ T12] GPR16: 0000000000000000 0000000000000000 > > 0000000000000000 0000000000000000 > > [ 0.222287][ T12] GPR20: c000000001624868 c000000001ab2708 > > c0000000019cfa08 c000000001a00d18 > > [ 0.222287][ T12] GPR24: c0000000019cfa18 fffffffffffffef7 > > c000000003051205 c0000000019cfa68 > > [ 0.222287][ T12] GPR28: 0000000000000000 c0000000019cfa60 > > c000000002894e90 0000000000000000 > > [ 0.222526][ T12] NIP [c00000000044947c] __find_event_file+0x9c/0x110 > > [ 0.222572][ T12] LR [c00000000041a584] > > init_tracer_tracefs+0x274/0xcc0 > > [ 0.222643][ T12] Call Trace: > > [ 0.222690][ T12] [c000000003bc7c00] [c000000000b943b0] > > tracefs_create_file+0x1a0/0x2b0 (unreliable) > > [ 0.222766][ T12] [c000000003bc7c50] [c00000000041a584] > > init_tracer_tracefs+0x274/0xcc0 > > [ 0.222791][ T12] [c000000003bc7dc0] [c000000002046f1c] > > tracer_init_tracefs_work_func+0x50/0x320 > > [ 0.222809][ T12] [c000000003bc7e50] [c000000000276958] > > process_one_work+0x1b8/0x530 > > [ 0.222828][ T12] [c000000003bc7f10] [c00000000027778c] > > worker_thread+0x1dc/0x3d0 > > [ 0.222883][ T12] [c000000003bc7f90] [c000000000284c44] > > kthread+0x194/0x1b0 > > [ 0.222900][ T12] [c000000003bc7fe0] [c00000000000cf30] > > start_kernel_thread+0x14/0x18 > > [ 0.222961][ T12] Code: 7c691b78 7f63db78 2c090000 40820018 e89c0000 > > 49107f21 60000000 2c030000 41820048 ebff0000 7c3ff040 41820038 <e93f0010> > > 7fa3eb78 81490058 e8890018 > > [ 0.223190][ T12] ---[ end trace 0000000000000000 ]--- > > ... > > > > Interestingly, turning on CONFIG_KASAN appears to hide this, maybe > > pointing to some sort of memory corruption (or something timing > > related)? If there is any other information I can provide, I am more > > than happy to do so. > > I don't have much idea on how things end up causing > NULL-pointer-deref... but let's point out suspicious things. > > > [1]: > > https://lore.kernel.org/[email protected]/ > > @Mathieu: In patch 1/3 description, > > Changes since v7: > > - Explicitly initialize the subsystem from start_kernel() right > > after mm_core_init() so it is up and running before the creation of > > the first mm at boot. > > But how does this work when someone calls mm_cpumask() on init_mm early? > Looks like it will behave incorrectly because get_rss_stat_items_size() > returns zero? > > While it doesn't crash on my environment, it triggers a two warnings > (with -smp 2 option added). IIUC the cpu bit should have been set in > setup_arch(), but at the wrong location. After the > percpu_counter_tree_subsystem_init() function is called, the bit doesn't > appear to be set. > > [ 1.392787][ T1] ------------[ cut here ]------------ > [ 1.392935][ T1] WARNING: arch/powerpc/mm/mmu_context.c:106 at > switch_mm_irqs_off+0x190/0x1c0, CPU#0: swapper/0/1 > [ 1.393187][ T1] Modules linked in: > [ 1.393458][ T1] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted > 7.0.0-rc4-next-20260319 #1 PREEMPTLAZY > [ 1.393600][ T1] Hardware name: IBM pSeries (emulated by qemu) POWER8 > (architected) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries > [ 1.393711][ T1] NIP: c00000000014e390 LR: c00000000014e30c CTR: > 0000000000000000 > [ 1.393752][ T1] REGS: c000000003def7b0 TRAP: 0700 Not tainted > (7.0.0-rc4-next-20260319) > [ 1.393807][ T1] MSR: 8000000002021032 <SF,VEC,ME,IR,DR,RI> CR: > 2800284a XER: 00000000 > [ 1.393944][ T1] CFAR: c00000000014e328 IRQMASK: 3 > [ 1.393944][ T1] GPR00: c00000000014e36c c000000003defa50 > c000000001bb8100 c0000000028d8c80 > [ 1.393944][ T1] GPR04: c000000004ddc04a 000000000000000a > 0000000022222222 2222222222222222 > [ 1.393944][ T1] GPR08: 2222222222222222 0000000000000000 > 0000000000000001 0000000000008000 > [ 1.393944][ T1] GPR12: c000000000521e80 c000000002c70000 > c00000000000fff0 0000000000000000 > [ 1.393944][ T1] GPR16: 0000000000000000 c00000000606c600 > c000000003623ac0 0000000000000000 > [ 1.393944][ T1] GPR20: c000000004c66300 c00000000606fc00 > 0000000000000001 0000000000000001 > [ 1.393944][ T1] GPR24: c000000006069c00 c00000000272c500 > 0000000000000000 0000000000000000 > [ 1.393944][ T1] GPR28: c000000003d68200 0000000000000000 > c0000000028d8a80 c00000000272bd00 > [ 1.394355][ T1] NIP [c00000000014e390] switch_mm_irqs_off+0x190/0x1c0 > [ 1.394395][ T1] LR [c00000000014e30c] switch_mm_irqs_off+0x10c/0x1c0 > [ 1.394519][ T1] Call Trace: > [ 1.394584][ T1] [c000000003defa50] [c00000000014e36c] > switch_mm_irqs_off+0x16c/0x1c0 (unreliable) > [ 1.394676][ T1] [c000000003defab0] [c0000000006edbf0] > begin_new_exec+0x534/0xf60 > [ 1.394732][ T1] [c000000003defb20] [c000000000795538] > load_elf_binary+0x494/0x1d1c > [ 1.394765][ T1] [c000000003defc70] [c0000000006eb910] > bprm_execve+0x380/0x720 > [ 1.394796][ T1] [c000000003defd00] [c0000000006ed5a8] > kernel_execve+0x12c/0x1bc > [ 1.394831][ T1] [c000000003defd50] [c00000000000eda8] > run_init_process+0xf8/0x160 > [ 1.394864][ T1] [c000000003defde0] [c0000000000100b4] > kernel_init+0xcc/0x268 > [ 1.394899][ T1] [c000000003defe50] [c00000000000cf14] > ret_from_kernel_user_thread+0x14/0x1c > [ 1.394946][ T1] ---- interrupt: 0 at 0x0 > [ 1.395205][ T1] Code: 7fe4fb78 7f83e378 48009171 60000000 4bffff98 > 60000000 60000000 60000000 0fe00000 4bffff00 60000000 60000000 <0fe00000> > 4bffff98 60000000 60000000 > [ 1.395420][ T1] ---[ end trace 0000000000000000 ]--- > [ 1.526024][ T67] mount (67) used greatest stack depth: 28432 bytes left > [ 1.605803][ T69] mount (69) used greatest stack depth: 27872 bytes left > [ 1.667853][ T71] mkdir (71) used greatest stack depth: 27248 bytes left > Saving 256 bits of creditable seed for next boot > [ 1.926636][ T80] ------------[ cut here ]------------ > [ 1.926719][ T80] WARNING: arch/powerpc/mm/mmu_context.c:51 at > switch_mm_irqs_off+0x180/0x1c0, CPU#0: S01seedrng/80 > [ 1.926782][ T80] Modules linked in: > [ 1.926910][ T80] CPU: 0 UID: 0 PID: 80 Comm: S01seedrng Tainted: G > W 7.0.0-rc4-next-20260319 #1 PREEMPTLAZY > [ 1.926990][ T80] Tainted: [W]=WARN > [ 1.927025][ T80] Hardware name: IBM pSeries (emulated by qemu) POWER8 > (architected) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries > [ 1.927091][ T80] NIP: c00000000014e380 LR: c00000000014e24c CTR: > c000000000232894 > [ 1.927131][ T80] REGS: c000000004d5f800 TRAP: 0700 Tainted: G > W (7.0.0-rc4-next-20260319) > [ 1.927179][ T80] MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: > 28002828 XER: 20000000 > [ 1.927253][ T80] CFAR: c00000000014e280 IRQMASK: 1 > [ 1.927253][ T80] GPR00: c0000000002328ec c000000004d5faa0 > c000000001bb8100 0000000000000080 > [ 1.927253][ T80] GPR04: c0000000028d8280 c000000004509c00 > 0000000000000002 c00000000272c700 > [ 1.927253][ T80] GPR08: fffffffffffffffe c0000000028d8280 > 0000000000000000 0000000048002828 > [ 1.927253][ T80] GPR12: c000000000232894 c000000002c70000 > 0000000000000000 0000000000000002 > [ 1.927253][ T80] GPR16: 0000000000000000 000001002f0a2958 > 000001002f0a2950 ffffffffffffffff > [ 1.927253][ T80] GPR20: 0000000000000000 0000000000000000 > c000000002ab1400 c00000000272c700 > [ 1.927253][ T80] GPR24: 0000000000000000 c0000000028d8a80 > 0000000000000000 0000000000000000 > [ 1.927253][ T80] GPR28: c000000004509c00 0000000000000000 > c00000000272bd00 c0000000028d8280 > [ 1.927629][ T80] NIP [c00000000014e380] switch_mm_irqs_off+0x180/0x1c0 > [ 1.927678][ T80] LR [c00000000014e24c] switch_mm_irqs_off+0x4c/0x1c0 > [ 1.927715][ T80] Call Trace: > [ 1.927737][ T80] [c000000004d5faa0] [c000000004d5faf0] > 0xc000000004d5faf0 (unreliable) > [ 1.927804][ T80] [c000000004d5fb00] [c0000000002328ec] > do_shoot_lazy_tlb+0x58/0x84 > [ 1.927853][ T80] [c000000004d5fb30] [c000000000388304] > smp_call_function_many_cond+0x6a0/0x8d8 > [ 1.927902][ T80] [c000000004d5fc20] [c000000000388624] > on_each_cpu_cond_mask+0x40/0x7c > [ 1.927943][ T80] [c000000004d5fc50] [c000000000232ad4] > __mmdrop+0x88/0x2ec > [ 1.927986][ T80] [c000000004d5fce0] [c000000000242104] > do_exit+0x350/0xde4 > [ 1.928028][ T80] [c000000004d5fdb0] [c000000000242de0] > do_group_exit+0x48/0xbc > [ 1.928072][ T80] [c000000004d5fdf0] [c000000000242e74] > pid_child_should_wake+0x0/0x84 > [ 1.928128][ T80] [c000000004d5fe10] [c000000000030218] > system_call_exception+0x148/0x3c0 > [ 1.928176][ T80] [c000000004d5fe50] [c00000000000c6d4] > system_call_common+0xf4/0x258 > [ 1.928217][ T80] ---- interrupt: c00 at 0x7fff8ade507c > [ 1.928253][ T80] NIP: 00007fff8ade507c LR: 00007fff8ade5034 CTR: > 0000000000000000 > [ 1.928291][ T80] REGS: c000000004d5fe80 TRAP: 0c00 Tainted: G > W (7.0.0-rc4-next-20260319) > [ 1.928333][ T80] MSR: 800000000280f032 > <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI> CR: 24002824 XER: 00000000 > [ 1.928413][ T80] IRQMASK: 0 > [ 1.928413][ T80] GPR00: 00000000000000ea 00007fffe75beb50 > 00007fff8aed7300 0000000000000000 > [ 1.928413][ T80] GPR04: 0000000000000000 00007fffe75beda0 > 00007fffe75bedb0 0000000000000000 > [ 1.928413][ T80] GPR08: 0000000000000000 0000000000000000 > 0000000000000000 0000000000000000 > [ 1.928413][ T80] GPR12: 0000000000000000 00007fff8afaae00 > 00007fffca692568 0000000133cf0440 > [ 1.928413][ T80] GPR16: 0000000000000000 000001002f0a2958 > 000001002f0a2950 ffffffffffffffff > [ 1.928413][ T80] GPR20: 0000000000000000 0000000000000000 > 00007fffe75bf838 00007fff8afa0000 > [ 1.928413][ T80] GPR24: 0000000126911328 0000000000000001 > 00007fff8af9dc00 00007fffe75bf818 > [ 1.928413][ T80] GPR28: 0000000000000003 fffffffffffff000 > 0000000000000000 00007fff8afa3e10 > [ 1.928765][ T80] NIP [00007fff8ade507c] 0x7fff8ade507c > [ 1.928795][ T80] LR [00007fff8ade5034] 0x7fff8ade5034 > [ 1.928835][ T80] ---- interrupt: c00 > [ 1.928924][ T80] Code: 7c0803a6 4e800020 60000000 60000000 7fe4fb78 > 7f83e378 48009171 60000000 4bffff98 60000000 60000000 60000000 <0fe00000> > 4bffff00 60000000 60000000 > [ 1.929054][ T80] ---[ end trace 0000000000000000 ]--- > > > [2]: > > https://lore.kernel.org/[email protected]/ > > > [3]: https://lore.kernel.org/[email protected]/ > > @Michal: Something my AI buddy pointed out... (that I think is valid): > > > diff --git a/mm/mm_init.c b/mm/mm_init.c > > index df34797691bd..7363b5b0d22a 100644 > > --- a/mm/mm_init.c > > +++ b/mm/mm_init.c > > @@ -2078,9 +2082,11 @@ deferred_init_memmap_chunk(unsigned long start_pfn, > > unsigned long end_pfn, > > unsigned long mo_pfn = ALIGN(spfn + 1, > > MAX_ORDER_NR_PAGES); > > unsigned long chunk_end = min(mo_pfn, epfn); > > > > - nr_pages += deferred_init_pages(zone, spfn, > > chunk_end); > > Previously, deferred_init_pages() returned nr of pages to add, which is > (end_pfn (= chunk_end) - spfn). > > > - deferred_free_pages(spfn, chunk_end - spfn); > > + // KHO scratch is MAX_ORDER_NR_PAGES aligned. > > + if (!pfn_is_kho_scratch(spfn)) > > + deferred_init_pages(zone, spfn, chunk_end); > > But since the function is not always called with the change, > the calculation is moved to... > > > + deferred_free_pages(spfn, chunk_end - spfn); > > spfn = chunk_end; > > > > if (can_resched) > > @@ -2088,6 +2094,7 @@ deferred_init_memmap_chunk(unsigned long start_pfn, > > unsigned long end_pfn, > > else > > touch_nmi_watchdog(); > > } > > + nr_pages += epfn - spfn; > > Here. > > But this is incorrect, because here we have: > > static unsigned long __init > > deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn, > > struct zone *zone, bool can_resched) > > { > > int nid = zone_to_nid(zone); > > unsigned long nr_pages = 0; > > phys_addr_t start, end; > > u64 i = 0; > > > > for_each_free_mem_range(i, nid, 0, &start, &end, NULL) { > > unsigned long spfn = PFN_UP(start); > > unsigned long epfn = PFN_DOWN(end); > > > > if (spfn >= end_pfn) > > break; > > > > spfn = max(spfn, start_pfn); > > epfn = min(epfn, end_pfn); > > > > while (spfn < epfn) { > > The loop condition is (spfn < epfn), and by the time the loop terminates... > > > unsigned long mo_pfn = ALIGN(spfn + 1, > > MAX_ORDER_NR_PAGES); > > unsigned long chunk_end = min(mo_pfn, epfn); > > > > // KHO scratch is MAX_ORDER_NR_PAGES aligned. > > if (!pfn_is_kho_scratch(spfn)) > > deferred_init_pages(zone, spfn, chunk_end); > > > > deferred_free_pages(spfn, chunk_end - spfn); > > spfn = chunk_end; > > > > if (can_resched) > > cond_resched(); > > else > > touch_nmi_watchdog(); > > } > > nr_pages += epfn - spfn; > > epfn - spfn <= 0. > > So the number of pages returned by deferred_init_memmap_chunk() becomes > incorrect. > > The equivalent translation of what's there before would be doing > `nr_pages += chunk_end - spfn;` within the loop.
Good point, thank you. This patch has already been removed from mm-new. > -- > Cheers, > Harry / Hyeonggon
