On 5/19/26 17:10, Juhyung Park wrote:
> free_pagetable() is called via free_hugepage_table() with
> get_order(PMD_SIZE) = 9 to free the 2 MB vmemmap PMD leaves that back
> struct page arrays on x86_64. After commit bf9e4e30f353 ("x86/mm: use
> pagetable_free()"), it goes through pagetable_free() instead of
> __free_pages(), and pagetable_free() ultimately calls
> __free_pages(page, compound_order()) which ignores the explicit order
> argument and infers it from the page's compound metadata.
>
> The vmemmap PMD chunks are allocated by vmemmap_alloc_block() using
> alloc_pages_node() without __GFP_COMP, so PG_head is not set and
> compound_order() returns 0. Only the first of 512 pages of each PMD
> chunk is returned to the buddy allocator on hot-remove; the remaining
> 511 pages stay allocated and become unreachable. Generalized: roughly
> 16 MB leaked per GB of hot-removed memory per cycle.
>
> The leak affects every memory hot-remove path on x86_64 when
> memmap_on_memory=N (the default), including dax_kmem, virtio-mem,
> balloon drivers, ACPI memory hotplug, and direct sysfs offline+remove.
> memmap_on_memory=Y avoids it because free_hugepage_table() then takes
> the altmap branch and does not call free_pagetable().
>
> Reproduced with CXL memory toggled through DAX in a loop:
>
> daxctl reconfigure-device --mode=system-ram dax0.0 --force
> daxctl reconfigure-device --mode=devdax dax0.0 --force
>
> Fixes: bf9e4e30f353 ("x86/mm: use pagetable_free()")
> Cc: [email protected]
> Cc: Lu Baolu <[email protected]>
> Cc: Jason Gunthorpe <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Mike Rapoport (Microsoft) <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Vishal Verma <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Assisted-by: Claude:claude-opus-4-7
> Signed-off-by: Juhyung Park <[email protected]>
> ---
> arch/x86/mm/init_64.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index df2261fa4f98..a2301bddb647 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1024,7 +1024,12 @@ static void __meminit free_pagetable(struct page
> *page, int order)
> free_reserved_pages(page, nr_pages);
> #endif
> } else {
> - pagetable_free(page_ptdesc(page));
> + /*
> + * Use __free_pages() to honor @order: vmemmap PMD leaves
> + * freed here are not compound pages, so pagetable_free()
> + * would lose leak 511 of 512 pages per 2 MB chunk.
> + */
> + __free_pages(page, order);
> }
> }
>
I sent a proper fix for this already:
https://lore.kernel.org/all/[email protected]/
--
Cheers,
David