On Fri, Mar 20, 2026 at 5:18 AM Harry Yoo <[email protected]> wrote:
>
> On Thu, Mar 19, 2026 at 04:37:45PM -0700, Nathan Chancellor wrote:
> > Hi all,
> >
> > I am not really sure whose bug this is, as it only appears when three
> > seemingly independent patch series are applied together, so I have added
> > the patch authors and their committers (along with the tracing
> > maintainers) to this thread. Feel free to expand or reduce that list as
> > necessary.
> >
> > Our continuous integration has noticed a crash when booting
> > ppc64_guest_defconfig in QEMU on the past few -next versions.
> >
> >   
> > https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/23311154492/job/67811527112
> >
> > This does not appear to be clang related, as it can be reproduced with
> > GCC 15.2.0 as well. Through multiple bisects, I was able to land on
> > applying:
> >
> >   mm: improve RSS counter approximation accuracy for proc interfaces [1]
> >   vdso/datastore: Allocate data pages dynamically [2]
> >   kho: fix deferred init of kho scratch [3]
> >
> > and their dependent changes on top of 7.0-rc4 is enough to reproduce
> > this (at least on two of my machines with the same commands). I have
> > attached the diff from the result of the following 'git apply' commands
> > below, done in a linux-next checkout.
> >
> >   $ git checkout v7.0-rc4
> >   HEAD is now at f338e7738378 Linux 7.0-rc4
> >
> >   # [1]
> >   $ git diff 
> > 60ddf3eed4999bae440d1cf9e5868ccb3f308b64^..087dd6d2cc12c82945ab859194c32e8e977daae3
> >  | git apply -3v
> >   ...
> >
> >   # [2]
> >   # Fix trivial conflict in init/main.c around headers
> >   $ git diff 
> > dc432ab7130bb39f5a351281a02d4bc61e85a14a^..05988dba11791ccbb458254484826b32f17f4ad2
> >  | git apply -3v
> >   ...
> >
> >   # [3]
> >   # Fix conflict in kernel/liveupdate/kexec_handover.c due to lack of 
> > kho_mem_retrieve(), just add pfn_is_kho_scratch()
> >   $ git show 4a78467ffb537463486968232daef1e8a2f105e3 | git apply -3v
> >   ...
> >
> >   $ make -skj"$(nproc)" ARCH=powerpc CROSS_COMPILE=powerpc64-linux- 
> > mrproper ppc64_guest_defconfig vmlinux
> >
> >   $ curl -LSs 
> > https://github.com/ClangBuiltLinux/boot-utils/releases/download/20241120-044434/ppc64-rootfs.cpio.zst
> >  | zstd -d >rootfs.cpio
> >
> >   $ qemu-system-ppc64 \
> >       -display none \
> >       -nodefaults \
> >       -cpu power8 \
> >       -machine pseries \
> >       -vga none \
> >       -kernel vmlinux \
> >       -initrd rootfs.cpio \
> >       -m 1G \
> >       -serial mon:stdio
>
> Thanks, such a detailed steps to reproduce!
> Interestingly, the combination of my compiler (GCC 13.3.0) and
> QEMU (8.2.2) don't trigger this bug.
>
> >   [    0.000000][    T0] Linux version 7.0.0-rc4-dirty 
> > (nathan@framework-amd-ryzen-maxplus-395) (powerpc64-linux-gcc (GCC) 15.2.0, 
> > GNU ld (GNU Binutils) 2.45) #1 SMP PREEMPT Thu Mar 19 15:45:53 MST 2026
> >   ...
> >   [    0.216764][    T1] vgaarb: loaded
> >   [    0.217590][    T1] clocksource: Switched to clocksource timebase
> >   [    0.221007][   T12] BUG: Kernel NULL pointer dereference at 0x00000010
> >   [    0.221049][   T12] Faulting instruction address: 0xc00000000044947c
> >   [    0.221237][   T12] Oops: Kernel access of bad area, sig: 11 [#1]
> >   [    0.221276][   T12] BE PAGE_SIZE=64K MMU=Hash  SMP NR_CPUS=2048 NUMA 
> > pSeries
> >   [    0.221359][   T12] Modules linked in:
> >   [    0.221556][   T12] CPU: 0 UID: 0 PID: 12 Comm: kworker/u4:0 Not 
> > tainted 7.0.0-rc4-dirty #1 PREEMPTLAZY
> >   [    0.221631][   T12] Hardware name: IBM pSeries (emulated by qemu) 
> > POWER8 (architected) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries
> >   [    0.221765][   T12] Workqueue: trace_init_wq 
> > tracer_init_tracefs_work_func
> >   [    0.222065][   T12] NIP:  c00000000044947c LR: c00000000041a584 CTR: 
> > c00000000053aa90
> >   [    0.222084][   T12] REGS: c000000003bc7960 TRAP: 0380   Not tainted  
> > (7.0.0-rc4-dirty)
> >   [    0.222111][   T12] MSR:  8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 
> > 44000204  XER: 00000000
> >   [    0.222287][   T12] CFAR: c000000000449420 IRQMASK: 0
> >   [    0.222287][   T12] GPR00: c00000000041a584 c000000003bc7c00 
> > c000000001c08100 c000000002892f20
> >   [    0.222287][   T12] GPR04: c0000000019cfa68 c0000000019cfa60 
> > 0000000000000001 0000000000000064
> >   [    0.222287][   T12] GPR08: 0000000000000002 0000000000000000 
> > c000000003bba000 0000000000000010
> >   [    0.222287][   T12] GPR12: c00000000053aa90 c000000002c50000 
> > c000000001ab25f8 c000000001626690
> >   [    0.222287][   T12] GPR16: 0000000000000000 0000000000000000 
> > 0000000000000000 0000000000000000
> >   [    0.222287][   T12] GPR20: c000000001624868 c000000001ab2708 
> > c0000000019cfa08 c000000001a00d18
> >   [    0.222287][   T12] GPR24: c0000000019cfa18 fffffffffffffef7 
> > c000000003051205 c0000000019cfa68
> >   [    0.222287][   T12] GPR28: 0000000000000000 c0000000019cfa60 
> > c000000002894e90 0000000000000000
> >   [    0.222526][   T12] NIP [c00000000044947c] __find_event_file+0x9c/0x110
> >   [    0.222572][   T12] LR [c00000000041a584] 
> > init_tracer_tracefs+0x274/0xcc0
> >   [    0.222643][   T12] Call Trace:
> >   [    0.222690][   T12] [c000000003bc7c00] [c000000000b943b0] 
> > tracefs_create_file+0x1a0/0x2b0 (unreliable)
> >   [    0.222766][   T12] [c000000003bc7c50] [c00000000041a584] 
> > init_tracer_tracefs+0x274/0xcc0
> >   [    0.222791][   T12] [c000000003bc7dc0] [c000000002046f1c] 
> > tracer_init_tracefs_work_func+0x50/0x320
> >   [    0.222809][   T12] [c000000003bc7e50] [c000000000276958] 
> > process_one_work+0x1b8/0x530
> >   [    0.222828][   T12] [c000000003bc7f10] [c00000000027778c] 
> > worker_thread+0x1dc/0x3d0
> >   [    0.222883][   T12] [c000000003bc7f90] [c000000000284c44] 
> > kthread+0x194/0x1b0
> >   [    0.222900][   T12] [c000000003bc7fe0] [c00000000000cf30] 
> > start_kernel_thread+0x14/0x18
> >   [    0.222961][   T12] Code: 7c691b78 7f63db78 2c090000 40820018 e89c0000 
> > 49107f21 60000000 2c030000 41820048 ebff0000 7c3ff040 41820038 <e93f0010> 
> > 7fa3eb78 81490058 e8890018
> >   [    0.223190][   T12] ---[ end trace 0000000000000000 ]---
> >   ...
> >
> > Interestingly, turning on CONFIG_KASAN appears to hide this, maybe
> > pointing to some sort of memory corruption (or something timing
> > related)? If there is any other information I can provide, I am more
> > than happy to do so.
>
> I don't have much idea on how things end up causing
> NULL-pointer-deref... but let's point out suspicious things.
>
> > [1]: 
> > https://lore.kernel.org/[email protected]/
>
> @Mathieu: In patch 1/3 description,
> > Changes since v7:
> > - Explicitly initialize the subsystem from start_kernel() right
> >   after mm_core_init() so it is up and running before the creation of
> >   the first mm at boot.
>
> But how does this work when someone calls mm_cpumask() on init_mm early?
> Looks like it will behave incorrectly because get_rss_stat_items_size()
> returns zero?
>
> While it doesn't crash on my environment, it triggers a two warnings
> (with -smp 2 option added). IIUC the cpu bit should have been set in
> setup_arch(), but at the wrong location. After the
> percpu_counter_tree_subsystem_init() function is called, the bit doesn't
> appear to be set.
>
> [    1.392787][    T1] ------------[ cut here ]------------
> [    1.392935][    T1] WARNING: arch/powerpc/mm/mmu_context.c:106 at 
> switch_mm_irqs_off+0x190/0x1c0, CPU#0: swapper/0/1
> [    1.393187][    T1] Modules linked in:
> [    1.393458][    T1] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 
> 7.0.0-rc4-next-20260319 #1 PREEMPTLAZY
> [    1.393600][    T1] Hardware name: IBM pSeries (emulated by qemu) POWER8 
> (architected) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries
> [    1.393711][    T1] NIP:  c00000000014e390 LR: c00000000014e30c CTR: 
> 0000000000000000
> [    1.393752][    T1] REGS: c000000003def7b0 TRAP: 0700   Not tainted  
> (7.0.0-rc4-next-20260319)
> [    1.393807][    T1] MSR:  8000000002021032 <SF,VEC,ME,IR,DR,RI>  CR: 
> 2800284a  XER: 00000000
> [    1.393944][    T1] CFAR: c00000000014e328 IRQMASK: 3
> [    1.393944][    T1] GPR00: c00000000014e36c c000000003defa50 
> c000000001bb8100 c0000000028d8c80
> [    1.393944][    T1] GPR04: c000000004ddc04a 000000000000000a 
> 0000000022222222 2222222222222222
> [    1.393944][    T1] GPR08: 2222222222222222 0000000000000000 
> 0000000000000001 0000000000008000
> [    1.393944][    T1] GPR12: c000000000521e80 c000000002c70000 
> c00000000000fff0 0000000000000000
> [    1.393944][    T1] GPR16: 0000000000000000 c00000000606c600 
> c000000003623ac0 0000000000000000
> [    1.393944][    T1] GPR20: c000000004c66300 c00000000606fc00 
> 0000000000000001 0000000000000001
> [    1.393944][    T1] GPR24: c000000006069c00 c00000000272c500 
> 0000000000000000 0000000000000000
> [    1.393944][    T1] GPR28: c000000003d68200 0000000000000000 
> c0000000028d8a80 c00000000272bd00
> [    1.394355][    T1] NIP [c00000000014e390] switch_mm_irqs_off+0x190/0x1c0
> [    1.394395][    T1] LR [c00000000014e30c] switch_mm_irqs_off+0x10c/0x1c0
> [    1.394519][    T1] Call Trace:
> [    1.394584][    T1] [c000000003defa50] [c00000000014e36c] 
> switch_mm_irqs_off+0x16c/0x1c0 (unreliable)
> [    1.394676][    T1] [c000000003defab0] [c0000000006edbf0] 
> begin_new_exec+0x534/0xf60
> [    1.394732][    T1] [c000000003defb20] [c000000000795538] 
> load_elf_binary+0x494/0x1d1c
> [    1.394765][    T1] [c000000003defc70] [c0000000006eb910] 
> bprm_execve+0x380/0x720
> [    1.394796][    T1] [c000000003defd00] [c0000000006ed5a8] 
> kernel_execve+0x12c/0x1bc
> [    1.394831][    T1] [c000000003defd50] [c00000000000eda8] 
> run_init_process+0xf8/0x160
> [    1.394864][    T1] [c000000003defde0] [c0000000000100b4] 
> kernel_init+0xcc/0x268
> [    1.394899][    T1] [c000000003defe50] [c00000000000cf14] 
> ret_from_kernel_user_thread+0x14/0x1c
> [    1.394946][    T1] ---- interrupt: 0 at 0x0
> [    1.395205][    T1] Code: 7fe4fb78 7f83e378 48009171 60000000 4bffff98 
> 60000000 60000000 60000000 0fe00000 4bffff00 60000000 60000000 <0fe00000> 
> 4bffff98 60000000 60000000
> [    1.395420][    T1] ---[ end trace 0000000000000000 ]---
> [    1.526024][   T67] mount (67) used greatest stack depth: 28432 bytes left
> [    1.605803][   T69] mount (69) used greatest stack depth: 27872 bytes left
> [    1.667853][   T71] mkdir (71) used greatest stack depth: 27248 bytes left
> Saving 256 bits of creditable seed for next boot
> [    1.926636][   T80] ------------[ cut here ]------------
> [    1.926719][   T80] WARNING: arch/powerpc/mm/mmu_context.c:51 at 
> switch_mm_irqs_off+0x180/0x1c0, CPU#0: S01seedrng/80
> [    1.926782][   T80] Modules linked in:
> [    1.926910][   T80] CPU: 0 UID: 0 PID: 80 Comm: S01seedrng Tainted: G      
>   W           7.0.0-rc4-next-20260319 #1 PREEMPTLAZY
> [    1.926990][   T80] Tainted: [W]=WARN
> [    1.927025][   T80] Hardware name: IBM pSeries (emulated by qemu) POWER8 
> (architected) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries
> [    1.927091][   T80] NIP:  c00000000014e380 LR: c00000000014e24c CTR: 
> c000000000232894
> [    1.927131][   T80] REGS: c000000004d5f800 TRAP: 0700   Tainted: G        
> W            (7.0.0-rc4-next-20260319)
> [    1.927179][   T80] MSR:  8000000000029032 <SF,EE,ME,IR,DR,RI>  CR: 
> 28002828  XER: 20000000
> [    1.927253][   T80] CFAR: c00000000014e280 IRQMASK: 1
> [    1.927253][   T80] GPR00: c0000000002328ec c000000004d5faa0 
> c000000001bb8100 0000000000000080
> [    1.927253][   T80] GPR04: c0000000028d8280 c000000004509c00 
> 0000000000000002 c00000000272c700
> [    1.927253][   T80] GPR08: fffffffffffffffe c0000000028d8280 
> 0000000000000000 0000000048002828
> [    1.927253][   T80] GPR12: c000000000232894 c000000002c70000 
> 0000000000000000 0000000000000002
> [    1.927253][   T80] GPR16: 0000000000000000 000001002f0a2958 
> 000001002f0a2950 ffffffffffffffff
> [    1.927253][   T80] GPR20: 0000000000000000 0000000000000000 
> c000000002ab1400 c00000000272c700
> [    1.927253][   T80] GPR24: 0000000000000000 c0000000028d8a80 
> 0000000000000000 0000000000000000
> [    1.927253][   T80] GPR28: c000000004509c00 0000000000000000 
> c00000000272bd00 c0000000028d8280
> [    1.927629][   T80] NIP [c00000000014e380] switch_mm_irqs_off+0x180/0x1c0
> [    1.927678][   T80] LR [c00000000014e24c] switch_mm_irqs_off+0x4c/0x1c0
> [    1.927715][   T80] Call Trace:
> [    1.927737][   T80] [c000000004d5faa0] [c000000004d5faf0] 
> 0xc000000004d5faf0 (unreliable)
> [    1.927804][   T80] [c000000004d5fb00] [c0000000002328ec] 
> do_shoot_lazy_tlb+0x58/0x84
> [    1.927853][   T80] [c000000004d5fb30] [c000000000388304] 
> smp_call_function_many_cond+0x6a0/0x8d8
> [    1.927902][   T80] [c000000004d5fc20] [c000000000388624] 
> on_each_cpu_cond_mask+0x40/0x7c
> [    1.927943][   T80] [c000000004d5fc50] [c000000000232ad4] 
> __mmdrop+0x88/0x2ec
> [    1.927986][   T80] [c000000004d5fce0] [c000000000242104] 
> do_exit+0x350/0xde4
> [    1.928028][   T80] [c000000004d5fdb0] [c000000000242de0] 
> do_group_exit+0x48/0xbc
> [    1.928072][   T80] [c000000004d5fdf0] [c000000000242e74] 
> pid_child_should_wake+0x0/0x84
> [    1.928128][   T80] [c000000004d5fe10] [c000000000030218] 
> system_call_exception+0x148/0x3c0
> [    1.928176][   T80] [c000000004d5fe50] [c00000000000c6d4] 
> system_call_common+0xf4/0x258
> [    1.928217][   T80] ---- interrupt: c00 at 0x7fff8ade507c
> [    1.928253][   T80] NIP:  00007fff8ade507c LR: 00007fff8ade5034 CTR: 
> 0000000000000000
> [    1.928291][   T80] REGS: c000000004d5fe80 TRAP: 0c00   Tainted: G        
> W            (7.0.0-rc4-next-20260319)
> [    1.928333][   T80] MSR:  800000000280f032 
> <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI>  CR: 24002824  XER: 00000000
> [    1.928413][   T80] IRQMASK: 0
> [    1.928413][   T80] GPR00: 00000000000000ea 00007fffe75beb50 
> 00007fff8aed7300 0000000000000000
> [    1.928413][   T80] GPR04: 0000000000000000 00007fffe75beda0 
> 00007fffe75bedb0 0000000000000000
> [    1.928413][   T80] GPR08: 0000000000000000 0000000000000000 
> 0000000000000000 0000000000000000
> [    1.928413][   T80] GPR12: 0000000000000000 00007fff8afaae00 
> 00007fffca692568 0000000133cf0440
> [    1.928413][   T80] GPR16: 0000000000000000 000001002f0a2958 
> 000001002f0a2950 ffffffffffffffff
> [    1.928413][   T80] GPR20: 0000000000000000 0000000000000000 
> 00007fffe75bf838 00007fff8afa0000
> [    1.928413][   T80] GPR24: 0000000126911328 0000000000000001 
> 00007fff8af9dc00 00007fffe75bf818
> [    1.928413][   T80] GPR28: 0000000000000003 fffffffffffff000 
> 0000000000000000 00007fff8afa3e10
> [    1.928765][   T80] NIP [00007fff8ade507c] 0x7fff8ade507c
> [    1.928795][   T80] LR [00007fff8ade5034] 0x7fff8ade5034
> [    1.928835][   T80] ---- interrupt: c00
> [    1.928924][   T80] Code: 7c0803a6 4e800020 60000000 60000000 7fe4fb78 
> 7f83e378 48009171 60000000 4bffff98 60000000 60000000 60000000 <0fe00000> 
> 4bffff00 60000000 60000000
> [    1.929054][   T80] ---[ end trace 0000000000000000 ]---
>
> > [2]: 
> > https://lore.kernel.org/[email protected]/
>
> > [3]: https://lore.kernel.org/[email protected]/
>
> @Michal: Something my AI buddy pointed out... (that I think is valid):
>
> > diff --git a/mm/mm_init.c b/mm/mm_init.c
> > index df34797691bd..7363b5b0d22a 100644
> > --- a/mm/mm_init.c
> > +++ b/mm/mm_init.c
> > @@ -2078,9 +2082,11 @@ deferred_init_memmap_chunk(unsigned long start_pfn, 
> > unsigned long end_pfn,
> >                       unsigned long mo_pfn = ALIGN(spfn + 1, 
> > MAX_ORDER_NR_PAGES);
> >                       unsigned long chunk_end = min(mo_pfn, epfn);
> >
> > -                     nr_pages += deferred_init_pages(zone, spfn, 
> > chunk_end);
>
> Previously, deferred_init_pages() returned nr of pages to add, which is
> (end_pfn (= chunk_end) - spfn).
>
> > -                     deferred_free_pages(spfn, chunk_end - spfn);
> > +                     // KHO scratch is MAX_ORDER_NR_PAGES aligned.
> > +                     if (!pfn_is_kho_scratch(spfn))
> > +                             deferred_init_pages(zone, spfn, chunk_end);
>
> But since the function is not always called with the change,
> the calculation is moved to...
>
> > +                     deferred_free_pages(spfn, chunk_end - spfn);
> >                       spfn = chunk_end;
> >
> >                       if (can_resched)
> > @@ -2088,6 +2094,7 @@ deferred_init_memmap_chunk(unsigned long start_pfn, 
> > unsigned long end_pfn,
> >                       else
> >                               touch_nmi_watchdog();
> >               }
> > +             nr_pages += epfn - spfn;
>
> Here.
>
> But this is incorrect, because here we have:
> > static unsigned long __init
> > deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> >                            struct zone *zone, bool can_resched)
> > {
> >         int nid = zone_to_nid(zone);
> >         unsigned long nr_pages = 0;
> >         phys_addr_t start, end;
> >         u64 i = 0;
> >
> >         for_each_free_mem_range(i, nid, 0, &start, &end, NULL) {
> >                 unsigned long spfn = PFN_UP(start);
> >                 unsigned long epfn = PFN_DOWN(end);
> >
> >                 if (spfn >= end_pfn)
> >                         break;
> >
> >                 spfn = max(spfn, start_pfn);
> >                 epfn = min(epfn, end_pfn);
> >
> >                 while (spfn < epfn) {
>
> The loop condition is (spfn < epfn), and by the time the loop terminates...
>
> >                         unsigned long mo_pfn = ALIGN(spfn + 1, 
> > MAX_ORDER_NR_PAGES);
> >                         unsigned long chunk_end = min(mo_pfn, epfn);
> >
> >                         // KHO scratch is MAX_ORDER_NR_PAGES aligned.
> >                         if (!pfn_is_kho_scratch(spfn))
> >                                 deferred_init_pages(zone, spfn, chunk_end);
> >
> >                         deferred_free_pages(spfn, chunk_end - spfn);
> >                         spfn = chunk_end;
> >
> >                         if (can_resched)
> >                                 cond_resched();
> >                         else
> >                                 touch_nmi_watchdog();
> >                 }
> >                 nr_pages += epfn - spfn;
>
> epfn - spfn <= 0.
>
> So the number of pages returned by deferred_init_memmap_chunk() becomes
> incorrect.
>
> The equivalent translation of what's there before would be doing
> `nr_pages += chunk_end - spfn;` within the loop.

Good point, thank you. This patch has already been removed from mm-new.

> --
> Cheers,
> Harry / Hyeonggon

Reply via email to