On Thu, Mar 19, 2026 at 04:37:45PM -0700, Nathan Chancellor wrote:
> Hi all,
> 
> I am not really sure whose bug this is, as it only appears when three
> seemingly independent patch series are applied together, so I have added
> the patch authors and their committers (along with the tracing
> maintainers) to this thread. Feel free to expand or reduce that list as
> necessary.
> 
> Our continuous integration has noticed a crash when booting
> ppc64_guest_defconfig in QEMU on the past few -next versions.
> 
>   
> https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/23311154492/job/67811527112
> 
> This does not appear to be clang related, as it can be reproduced with
> GCC 15.2.0 as well. Through multiple bisects, I was able to land on
> applying:
> 
>   mm: improve RSS counter approximation accuracy for proc interfaces [1]
>   vdso/datastore: Allocate data pages dynamically [2]
>   kho: fix deferred init of kho scratch [3]
> 
> and their dependent changes on top of 7.0-rc4 is enough to reproduce
> this (at least on two of my machines with the same commands). I have
> attached the diff from the result of the following 'git apply' commands
> below, done in a linux-next checkout.
> 
>   $ git checkout v7.0-rc4
>   HEAD is now at f338e7738378 Linux 7.0-rc4
> 
>   # [1]
>   $ git diff 
> 60ddf3eed4999bae440d1cf9e5868ccb3f308b64^..087dd6d2cc12c82945ab859194c32e8e977daae3
>  | git apply -3v
>   ...
> 
>   # [2]
>   # Fix trivial conflict in init/main.c around headers
>   $ git diff 
> dc432ab7130bb39f5a351281a02d4bc61e85a14a^..05988dba11791ccbb458254484826b32f17f4ad2
>  | git apply -3v
>   ...
> 
>   # [3]
>   # Fix conflict in kernel/liveupdate/kexec_handover.c due to lack of 
> kho_mem_retrieve(), just add pfn_is_kho_scratch()
>   $ git show 4a78467ffb537463486968232daef1e8a2f105e3 | git apply -3v
>   ...
> 
>   $ make -skj"$(nproc)" ARCH=powerpc CROSS_COMPILE=powerpc64-linux- mrproper 
> ppc64_guest_defconfig vmlinux
> 
>   $ curl -LSs 
> https://github.com/ClangBuiltLinux/boot-utils/releases/download/20241120-044434/ppc64-rootfs.cpio.zst
>  | zstd -d >rootfs.cpio
> 
>   $ qemu-system-ppc64 \
>       -display none \
>       -nodefaults \
>       -cpu power8 \
>       -machine pseries \
>       -vga none \
>       -kernel vmlinux \
>       -initrd rootfs.cpio \
>       -m 1G \
>       -serial mon:stdio

Thanks, such a detailed steps to reproduce!
Interestingly, the combination of my compiler (GCC 13.3.0) and
QEMU (8.2.2) don't trigger this bug.

>   [    0.000000][    T0] Linux version 7.0.0-rc4-dirty 
> (nathan@framework-amd-ryzen-maxplus-395) (powerpc64-linux-gcc (GCC) 15.2.0, 
> GNU ld (GNU Binutils) 2.45) #1 SMP PREEMPT Thu Mar 19 15:45:53 MST 2026
>   ...
>   [    0.216764][    T1] vgaarb: loaded
>   [    0.217590][    T1] clocksource: Switched to clocksource timebase
>   [    0.221007][   T12] BUG: Kernel NULL pointer dereference at 0x00000010
>   [    0.221049][   T12] Faulting instruction address: 0xc00000000044947c
>   [    0.221237][   T12] Oops: Kernel access of bad area, sig: 11 [#1]
>   [    0.221276][   T12] BE PAGE_SIZE=64K MMU=Hash  SMP NR_CPUS=2048 NUMA 
> pSeries
>   [    0.221359][   T12] Modules linked in:
>   [    0.221556][   T12] CPU: 0 UID: 0 PID: 12 Comm: kworker/u4:0 Not tainted 
> 7.0.0-rc4-dirty #1 PREEMPTLAZY
>   [    0.221631][   T12] Hardware name: IBM pSeries (emulated by qemu) POWER8 
> (architected) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries
>   [    0.221765][   T12] Workqueue: trace_init_wq 
> tracer_init_tracefs_work_func
>   [    0.222065][   T12] NIP:  c00000000044947c LR: c00000000041a584 CTR: 
> c00000000053aa90
>   [    0.222084][   T12] REGS: c000000003bc7960 TRAP: 0380   Not tainted  
> (7.0.0-rc4-dirty)
>   [    0.222111][   T12] MSR:  8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 
> 44000204  XER: 00000000
>   [    0.222287][   T12] CFAR: c000000000449420 IRQMASK: 0
>   [    0.222287][   T12] GPR00: c00000000041a584 c000000003bc7c00 
> c000000001c08100 c000000002892f20
>   [    0.222287][   T12] GPR04: c0000000019cfa68 c0000000019cfa60 
> 0000000000000001 0000000000000064
>   [    0.222287][   T12] GPR08: 0000000000000002 0000000000000000 
> c000000003bba000 0000000000000010
>   [    0.222287][   T12] GPR12: c00000000053aa90 c000000002c50000 
> c000000001ab25f8 c000000001626690
>   [    0.222287][   T12] GPR16: 0000000000000000 0000000000000000 
> 0000000000000000 0000000000000000
>   [    0.222287][   T12] GPR20: c000000001624868 c000000001ab2708 
> c0000000019cfa08 c000000001a00d18
>   [    0.222287][   T12] GPR24: c0000000019cfa18 fffffffffffffef7 
> c000000003051205 c0000000019cfa68
>   [    0.222287][   T12] GPR28: 0000000000000000 c0000000019cfa60 
> c000000002894e90 0000000000000000
>   [    0.222526][   T12] NIP [c00000000044947c] __find_event_file+0x9c/0x110
>   [    0.222572][   T12] LR [c00000000041a584] init_tracer_tracefs+0x274/0xcc0
>   [    0.222643][   T12] Call Trace:
>   [    0.222690][   T12] [c000000003bc7c00] [c000000000b943b0] 
> tracefs_create_file+0x1a0/0x2b0 (unreliable)
>   [    0.222766][   T12] [c000000003bc7c50] [c00000000041a584] 
> init_tracer_tracefs+0x274/0xcc0
>   [    0.222791][   T12] [c000000003bc7dc0] [c000000002046f1c] 
> tracer_init_tracefs_work_func+0x50/0x320
>   [    0.222809][   T12] [c000000003bc7e50] [c000000000276958] 
> process_one_work+0x1b8/0x530
>   [    0.222828][   T12] [c000000003bc7f10] [c00000000027778c] 
> worker_thread+0x1dc/0x3d0
>   [    0.222883][   T12] [c000000003bc7f90] [c000000000284c44] 
> kthread+0x194/0x1b0
>   [    0.222900][   T12] [c000000003bc7fe0] [c00000000000cf30] 
> start_kernel_thread+0x14/0x18
>   [    0.222961][   T12] Code: 7c691b78 7f63db78 2c090000 40820018 e89c0000 
> 49107f21 60000000 2c030000 41820048 ebff0000 7c3ff040 41820038 <e93f0010> 
> 7fa3eb78 81490058 e8890018
>   [    0.223190][   T12] ---[ end trace 0000000000000000 ]---
>   ...
>
> Interestingly, turning on CONFIG_KASAN appears to hide this, maybe
> pointing to some sort of memory corruption (or something timing
> related)? If there is any other information I can provide, I am more
> than happy to do so.

I don't have much idea on how things end up causing
NULL-pointer-deref... but let's point out suspicious things.

> [1]: 
> https://lore.kernel.org/[email protected]/

@Mathieu: In patch 1/3 description,
> Changes since v7:
> - Explicitly initialize the subsystem from start_kernel() right
>   after mm_core_init() so it is up and running before the creation of
>   the first mm at boot.

But how does this work when someone calls mm_cpumask() on init_mm early?
Looks like it will behave incorrectly because get_rss_stat_items_size()
returns zero?

While it doesn't crash on my environment, it triggers a two warnings
(with -smp 2 option added). IIUC the cpu bit should have been set in
setup_arch(), but at the wrong location. After the
percpu_counter_tree_subsystem_init() function is called, the bit doesn't
appear to be set.

[    1.392787][    T1] ------------[ cut here ]------------
[    1.392935][    T1] WARNING: arch/powerpc/mm/mmu_context.c:106 at 
switch_mm_irqs_off+0x190/0x1c0, CPU#0: swapper/0/1
[    1.393187][    T1] Modules linked in:
[    1.393458][    T1] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 
7.0.0-rc4-next-20260319 #1 PREEMPTLAZY
[    1.393600][    T1] Hardware name: IBM pSeries (emulated by qemu) POWER8 
(architected) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries
[    1.393711][    T1] NIP:  c00000000014e390 LR: c00000000014e30c CTR: 
0000000000000000
[    1.393752][    T1] REGS: c000000003def7b0 TRAP: 0700   Not tainted  
(7.0.0-rc4-next-20260319)
[    1.393807][    T1] MSR:  8000000002021032 <SF,VEC,ME,IR,DR,RI>  CR: 
2800284a  XER: 00000000
[    1.393944][    T1] CFAR: c00000000014e328 IRQMASK: 3
[    1.393944][    T1] GPR00: c00000000014e36c c000000003defa50 
c000000001bb8100 c0000000028d8c80
[    1.393944][    T1] GPR04: c000000004ddc04a 000000000000000a 
0000000022222222 2222222222222222
[    1.393944][    T1] GPR08: 2222222222222222 0000000000000000 
0000000000000001 0000000000008000
[    1.393944][    T1] GPR12: c000000000521e80 c000000002c70000 
c00000000000fff0 0000000000000000
[    1.393944][    T1] GPR16: 0000000000000000 c00000000606c600 
c000000003623ac0 0000000000000000
[    1.393944][    T1] GPR20: c000000004c66300 c00000000606fc00 
0000000000000001 0000000000000001
[    1.393944][    T1] GPR24: c000000006069c00 c00000000272c500 
0000000000000000 0000000000000000
[    1.393944][    T1] GPR28: c000000003d68200 0000000000000000 
c0000000028d8a80 c00000000272bd00
[    1.394355][    T1] NIP [c00000000014e390] switch_mm_irqs_off+0x190/0x1c0
[    1.394395][    T1] LR [c00000000014e30c] switch_mm_irqs_off+0x10c/0x1c0
[    1.394519][    T1] Call Trace:
[    1.394584][    T1] [c000000003defa50] [c00000000014e36c] 
switch_mm_irqs_off+0x16c/0x1c0 (unreliable)
[    1.394676][    T1] [c000000003defab0] [c0000000006edbf0] 
begin_new_exec+0x534/0xf60
[    1.394732][    T1] [c000000003defb20] [c000000000795538] 
load_elf_binary+0x494/0x1d1c
[    1.394765][    T1] [c000000003defc70] [c0000000006eb910] 
bprm_execve+0x380/0x720
[    1.394796][    T1] [c000000003defd00] [c0000000006ed5a8] 
kernel_execve+0x12c/0x1bc
[    1.394831][    T1] [c000000003defd50] [c00000000000eda8] 
run_init_process+0xf8/0x160
[    1.394864][    T1] [c000000003defde0] [c0000000000100b4] 
kernel_init+0xcc/0x268
[    1.394899][    T1] [c000000003defe50] [c00000000000cf14] 
ret_from_kernel_user_thread+0x14/0x1c
[    1.394946][    T1] ---- interrupt: 0 at 0x0
[    1.395205][    T1] Code: 7fe4fb78 7f83e378 48009171 60000000 4bffff98 
60000000 60000000 60000000 0fe00000 4bffff00 60000000 60000000 <0fe00000> 
4bffff98 60000000 60000000
[    1.395420][    T1] ---[ end trace 0000000000000000 ]---
[    1.526024][   T67] mount (67) used greatest stack depth: 28432 bytes left
[    1.605803][   T69] mount (69) used greatest stack depth: 27872 bytes left
[    1.667853][   T71] mkdir (71) used greatest stack depth: 27248 bytes left
Saving 256 bits of creditable seed for next boot
[    1.926636][   T80] ------------[ cut here ]------------
[    1.926719][   T80] WARNING: arch/powerpc/mm/mmu_context.c:51 at 
switch_mm_irqs_off+0x180/0x1c0, CPU#0: S01seedrng/80
[    1.926782][   T80] Modules linked in:
[    1.926910][   T80] CPU: 0 UID: 0 PID: 80 Comm: S01seedrng Tainted: G        
W           7.0.0-rc4-next-20260319 #1 PREEMPTLAZY
[    1.926990][   T80] Tainted: [W]=WARN
[    1.927025][   T80] Hardware name: IBM pSeries (emulated by qemu) POWER8 
(architected) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries
[    1.927091][   T80] NIP:  c00000000014e380 LR: c00000000014e24c CTR: 
c000000000232894
[    1.927131][   T80] REGS: c000000004d5f800 TRAP: 0700   Tainted: G        W  
          (7.0.0-rc4-next-20260319)
[    1.927179][   T80] MSR:  8000000000029032 <SF,EE,ME,IR,DR,RI>  CR: 28002828 
 XER: 20000000
[    1.927253][   T80] CFAR: c00000000014e280 IRQMASK: 1
[    1.927253][   T80] GPR00: c0000000002328ec c000000004d5faa0 
c000000001bb8100 0000000000000080
[    1.927253][   T80] GPR04: c0000000028d8280 c000000004509c00 
0000000000000002 c00000000272c700
[    1.927253][   T80] GPR08: fffffffffffffffe c0000000028d8280 
0000000000000000 0000000048002828
[    1.927253][   T80] GPR12: c000000000232894 c000000002c70000 
0000000000000000 0000000000000002
[    1.927253][   T80] GPR16: 0000000000000000 000001002f0a2958 
000001002f0a2950 ffffffffffffffff
[    1.927253][   T80] GPR20: 0000000000000000 0000000000000000 
c000000002ab1400 c00000000272c700
[    1.927253][   T80] GPR24: 0000000000000000 c0000000028d8a80 
0000000000000000 0000000000000000
[    1.927253][   T80] GPR28: c000000004509c00 0000000000000000 
c00000000272bd00 c0000000028d8280
[    1.927629][   T80] NIP [c00000000014e380] switch_mm_irqs_off+0x180/0x1c0
[    1.927678][   T80] LR [c00000000014e24c] switch_mm_irqs_off+0x4c/0x1c0
[    1.927715][   T80] Call Trace:
[    1.927737][   T80] [c000000004d5faa0] [c000000004d5faf0] 0xc000000004d5faf0 
(unreliable)
[    1.927804][   T80] [c000000004d5fb00] [c0000000002328ec] 
do_shoot_lazy_tlb+0x58/0x84
[    1.927853][   T80] [c000000004d5fb30] [c000000000388304] 
smp_call_function_many_cond+0x6a0/0x8d8
[    1.927902][   T80] [c000000004d5fc20] [c000000000388624] 
on_each_cpu_cond_mask+0x40/0x7c
[    1.927943][   T80] [c000000004d5fc50] [c000000000232ad4] __mmdrop+0x88/0x2ec
[    1.927986][   T80] [c000000004d5fce0] [c000000000242104] do_exit+0x350/0xde4
[    1.928028][   T80] [c000000004d5fdb0] [c000000000242de0] 
do_group_exit+0x48/0xbc
[    1.928072][   T80] [c000000004d5fdf0] [c000000000242e74] 
pid_child_should_wake+0x0/0x84
[    1.928128][   T80] [c000000004d5fe10] [c000000000030218] 
system_call_exception+0x148/0x3c0
[    1.928176][   T80] [c000000004d5fe50] [c00000000000c6d4] 
system_call_common+0xf4/0x258
[    1.928217][   T80] ---- interrupt: c00 at 0x7fff8ade507c
[    1.928253][   T80] NIP:  00007fff8ade507c LR: 00007fff8ade5034 CTR: 
0000000000000000
[    1.928291][   T80] REGS: c000000004d5fe80 TRAP: 0c00   Tainted: G        W  
          (7.0.0-rc4-next-20260319)
[    1.928333][   T80] MSR:  800000000280f032 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI> 
 CR: 24002824  XER: 00000000
[    1.928413][   T80] IRQMASK: 0
[    1.928413][   T80] GPR00: 00000000000000ea 00007fffe75beb50 
00007fff8aed7300 0000000000000000
[    1.928413][   T80] GPR04: 0000000000000000 00007fffe75beda0 
00007fffe75bedb0 0000000000000000
[    1.928413][   T80] GPR08: 0000000000000000 0000000000000000 
0000000000000000 0000000000000000
[    1.928413][   T80] GPR12: 0000000000000000 00007fff8afaae00 
00007fffca692568 0000000133cf0440
[    1.928413][   T80] GPR16: 0000000000000000 000001002f0a2958 
000001002f0a2950 ffffffffffffffff
[    1.928413][   T80] GPR20: 0000000000000000 0000000000000000 
00007fffe75bf838 00007fff8afa0000
[    1.928413][   T80] GPR24: 0000000126911328 0000000000000001 
00007fff8af9dc00 00007fffe75bf818
[    1.928413][   T80] GPR28: 0000000000000003 fffffffffffff000 
0000000000000000 00007fff8afa3e10
[    1.928765][   T80] NIP [00007fff8ade507c] 0x7fff8ade507c
[    1.928795][   T80] LR [00007fff8ade5034] 0x7fff8ade5034
[    1.928835][   T80] ---- interrupt: c00
[    1.928924][   T80] Code: 7c0803a6 4e800020 60000000 60000000 7fe4fb78 
7f83e378 48009171 60000000 4bffff98 60000000 60000000 60000000 <0fe00000> 
4bffff00 60000000 60000000
[    1.929054][   T80] ---[ end trace 0000000000000000 ]---

> [2]: 
> https://lore.kernel.org/[email protected]/

> [3]: https://lore.kernel.org/[email protected]/

@Michal: Something my AI buddy pointed out... (that I think is valid):

> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index df34797691bd..7363b5b0d22a 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -2078,9 +2082,11 @@ deferred_init_memmap_chunk(unsigned long start_pfn, 
> unsigned long end_pfn,
>                       unsigned long mo_pfn = ALIGN(spfn + 1, 
> MAX_ORDER_NR_PAGES);
>                       unsigned long chunk_end = min(mo_pfn, epfn);
>  
> -                     nr_pages += deferred_init_pages(zone, spfn, chunk_end);

Previously, deferred_init_pages() returned nr of pages to add, which is
(end_pfn (= chunk_end) - spfn).

> -                     deferred_free_pages(spfn, chunk_end - spfn);
> +                     // KHO scratch is MAX_ORDER_NR_PAGES aligned.
> +                     if (!pfn_is_kho_scratch(spfn))
> +                             deferred_init_pages(zone, spfn, chunk_end);

But since the function is not always called with the change,
the calculation is moved to...

> +                     deferred_free_pages(spfn, chunk_end - spfn);
>                       spfn = chunk_end;
>  
>                       if (can_resched)
> @@ -2088,6 +2094,7 @@ deferred_init_memmap_chunk(unsigned long start_pfn, 
> unsigned long end_pfn,
>                       else
>                               touch_nmi_watchdog();
>               }
> +             nr_pages += epfn - spfn;

Here.

But this is incorrect, because here we have:
> static unsigned long __init
> deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
>                            struct zone *zone, bool can_resched)
> {
>         int nid = zone_to_nid(zone);
>         unsigned long nr_pages = 0;
>         phys_addr_t start, end;
>         u64 i = 0;
> 
>         for_each_free_mem_range(i, nid, 0, &start, &end, NULL) {
>                 unsigned long spfn = PFN_UP(start);
>                 unsigned long epfn = PFN_DOWN(end);
> 
>                 if (spfn >= end_pfn)
>                         break;
> 
>                 spfn = max(spfn, start_pfn);
>                 epfn = min(epfn, end_pfn);
> 
>                 while (spfn < epfn) {

The loop condition is (spfn < epfn), and by the time the loop terminates...

>                         unsigned long mo_pfn = ALIGN(spfn + 1, 
> MAX_ORDER_NR_PAGES);
>                         unsigned long chunk_end = min(mo_pfn, epfn);
> 
>                         // KHO scratch is MAX_ORDER_NR_PAGES aligned.
>                         if (!pfn_is_kho_scratch(spfn))
>                                 deferred_init_pages(zone, spfn, chunk_end);
> 
>                         deferred_free_pages(spfn, chunk_end - spfn);
>                         spfn = chunk_end;
> 
>                         if (can_resched)
>                                 cond_resched();
>                         else
>                                 touch_nmi_watchdog();
>                 }
>                 nr_pages += epfn - spfn;

epfn - spfn <= 0.

So the number of pages returned by deferred_init_memmap_chunk() becomes
incorrect.

The equivalent translation of what's there before would be doing
`nr_pages += chunk_end - spfn;` within the loop.

-- 
Cheers,
Harry / Hyeonggon

Reply via email to