Hi Vlastimil,

On Fri, Feb 27, 2026 at 04:14:42PM +0100, Vlastimil Babka wrote:
> On 1/11/26 09:20, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <[email protected]>
> > 
> > To initialize node, zone and memory map data structures every architecture
> > calls free_area_init() during setup_arch() and passes it an array of zone
> > limits.
> > 
> > Beside code duplication it creates "interesting" ordering cases between
> > allocation and initialization of hugetlb and the memory map. Some
> > architectures allocate hugetlb pages very early in setup_arch() in certain
> > cases, some only create hugetlb CMA areas in setup_arch() and sometimes
> > hugetlb allocations happen mm_core_init().
> > 
> > With arch_zone_limits_init() helper available now on all architectures it
> > is no longer necessary to call free_area_init() from architecture setup
> > code. Rather core MM initialization can call arch_zone_limits_init() in a
> > single place.
> > 
> > This allows to unify ordering of hugetlb vs memory map allocation and
> > initialization.
> > 
> > Remove the call to free_area_init() from architecture specific code and
> > place it in a new mm_core_init_early() function that is called immediately
> > after setup_arch().
> > 
> > After this refactoring it is possible to consolidate hugetlb allocations
> > and eliminate differences in ordering of hugetlb and memory map
> > initialization among different architectures.
> > 
> > As the first step of this consolidation move hugetlb_bootmem_alloc() to
> > mm_core_early_init().
> > 
> > Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
> I've bisected a problem with virtme-ng testing a NUMA memoryless
> node setup (on x86_64) to this patch (commit d49004c5f0c1).
> 
> It's executed like this, where node 0 has memory and node 1 only cpus:
> 
> vng -vr . -p 8 -m 4G --numa 4G,cpus=0-3 --numa 0,cpus=4-7
> 
> This fails to boot due to:
> 
> [    0.095894] BUG: unable to handle page fault for address: 0000000000004620
> [    0.097196] #PF: supervisor read access in kernel mode
> [    0.098180] #PF: error_code(0x0000) - not-present page
> [    0.099155] PGD 0 P4D 0 
> [    0.099641] Oops: Oops: 0000 [#1] SMP NOPTI
> [    0.100437] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 
> 6.19.0-rc6-00152-gf206359553c9 #53 PREEMPT 
> [    0.102201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.17.0-2-g4f253b9b-prebuilt.qemu.org 04/01/2014
> [    0.104313] RIP: 0010:mm_core_init_early+0x263/0x900
> [    0.105271] Code: 93 ff 72 09 8b 7c 24 30 e8 da 82 00 00 48 63 44 24 30 45 
> 31 db 4c 8b 24 c5 a0 7b 1d 9a 48 89 c3 4c 89 5c 24 50 4c 89 5c 24 58 <41> 83 
> bc 24 20 46 00 00 00 75 0b 41 83 bc 24 14 47 00 00 00 74 04
> [    0.108863] RSP: 0000:ffffffff99403e38 EFLAGS: 00010046
> [    0.109861] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 
> 0000000000000001
> [    0.111223] RDX: 0000000000000040 RSI: 0000000000100000 RDI: 
> ffff89597fffae00
> [    0.112577] RBP: 0000000000000005 R08: 0000000000000000 R09: 
> ffff89597fffa200
> [    0.113924] R10: 80000000ffffe000 R11: 0000000000000000 R12: 
> 0000000000000000
> [    0.115294] R13: 0000000000000000 R14: 0000000000000000 R15: 
> 0000000000000000
> [    0.116656] FS:  0000000000000000(0000) GS:0000000000000000(0000) 
> knlGS:0000000000000000
> [    0.118193] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.119283] CR2: 0000000000004620 CR3: 0000000060048000 CR4: 
> 00000000000000b0
> [    0.120645] Call Trace:
> [    0.121122]  <TASK>
> [    0.121521]  start_kernel+0x5d/0x780
> [    0.122206]  x86_64_start_reservations+0x24/0x30
> [    0.123079]  x86_64_start_kernel+0xd1/0xe0
> [    0.123860]  common_startup_64+0x12c/0x138
> [    0.124641]  </TASK>
> [    0.125061] Modules linked in:
> [    0.125646] CR2: 0000000000004620
> [    0.126279] ---[ end trace 0000000000000000 ]---
> [    0.127162] RIP: 0010:mm_core_init_early+0x263/0x900
> [    0.128106] Code: 93 ff 72 09 8b 7c 24 30 e8 da 82 00 00 48 63 44 24 30 45 
> 31 db 4c 8b 24 c5 a0 7b 1d 9a 48 89 c3 4c 89 5c 24 50 4c 89 5c 24 58 <41> 83 
> bc 24 20 46 00 00 00 75 0b 41 83 bc 24 14 47 00 00 00 74 04
> [    0.131676] RSP: 0000:ffffffff99403e38 EFLAGS: 00010046
> [    0.132684] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 
> 0000000000000001
> [    0.134033] RDX: 0000000000000040 RSI: 0000000000100000 RDI: 
> ffff89597fffae00
> [    0.135412] RBP: 0000000000000005 R08: 0000000000000000 R09: 
> ffff89597fffa200
> [    0.136763] R10: 80000000ffffe000 R11: 0000000000000000 R12: 
> 0000000000000000
> [    0.138112] R13: 0000000000000000 R14: 0000000000000000 R15: 
> 0000000000000000
> [    0.139487] FS:  0000000000000000(0000) GS:0000000000000000(0000) 
> knlGS:0000000000000000
> [    0.141014] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.142094] CR2: 0000000000004620 CR3: 0000000060048000 CR4: 
> 00000000000000b0
> [    0.143448] Kernel panic - not syncing: Attempted to kill the idle task!
> [    0.144833] ---[ end Kernel panic - not syncing: Attempted to kill the 
> idle task! ]---
> 
> > ./scripts/faddr2line vmlinux mm_core_init_early+0x263/0x900
> mm_core_init_early+0x263/0x900:
> free_area_init_node at mm/mm_init.c:1721
> (inlined by) free_area_init at mm/mm_init.c:1902
> (inlined by) mm_core_init_early at mm/mm_init.c:2681
> 
> It crashes at WARN_ON(pgdat->nr_zones || pgdat->kswapd_highest_zoneidx);
> because pgdat is NULL.
> 
> With some debug printk's I've figured out that in free_area_init()
> we have:
> 
>                 if (!node_online(nid))
>                         alloc_offline_node_data(nid);
>              
>                 pgdat = NODE_DATA(nid);
>                 free_area_init_node(nid);
> 
> 
> But node_online() is true so this allocation doesn't happen, and
> pgdat remains NULL.
> 
> And node_online() becomes true in init_cpu_to_node():
> 
>                 if (!node_online(node))
>                         node_set_online(node);
> 
> But without having a pgdat allocated.
> 
> I was able to workaround this by changing the code in free_area_init() to
> 
>                if (!node_online(nid) || !NODE_DATA(nid))
>                         alloc_offline_node_data(nid);

if (!NODE_DATA(nid)) is enough ...
 
> But I don't have the bigger picture, and also didn't check yet what exactly
> about this patch results in the failure. Probably ordering of various related 
> actions. Thoughts?

... and there's a fix already in the mm-hotfixes-stable:

https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-hotfixes-unstable&id=a4ab97e34bb687a2ca63fc70b47e8762e689797f

-- 
Sincerely yours,
Mike.

Reply via email to