Neat. Any sign of it getting merged? Thanks.
On Wed, May 20, 2026 at 2:24 PM David Hildenbrand (Arm) <[email protected]> wrote: > > On 5/19/26 17:10, Juhyung Park wrote: > > free_pagetable() is called via free_hugepage_table() with > > get_order(PMD_SIZE) = 9 to free the 2 MB vmemmap PMD leaves that back > > struct page arrays on x86_64. After commit bf9e4e30f353 ("x86/mm: use > > pagetable_free()"), it goes through pagetable_free() instead of > > __free_pages(), and pagetable_free() ultimately calls > > __free_pages(page, compound_order()) which ignores the explicit order > > argument and infers it from the page's compound metadata. > > > > The vmemmap PMD chunks are allocated by vmemmap_alloc_block() using > > alloc_pages_node() without __GFP_COMP, so PG_head is not set and > > compound_order() returns 0. Only the first of 512 pages of each PMD > > chunk is returned to the buddy allocator on hot-remove; the remaining > > 511 pages stay allocated and become unreachable. Generalized: roughly > > 16 MB leaked per GB of hot-removed memory per cycle. > > > > The leak affects every memory hot-remove path on x86_64 when > > memmap_on_memory=N (the default), including dax_kmem, virtio-mem, > > balloon drivers, ACPI memory hotplug, and direct sysfs offline+remove. > > memmap_on_memory=Y avoids it because free_hugepage_table() then takes > > the altmap branch and does not call free_pagetable(). > > > > Reproduced with CXL memory toggled through DAX in a loop: > > > > daxctl reconfigure-device --mode=system-ram dax0.0 --force > > daxctl reconfigure-device --mode=devdax dax0.0 --force > > > > Fixes: bf9e4e30f353 ("x86/mm: use pagetable_free()") > > Cc: [email protected] > > Cc: Lu Baolu <[email protected]> > > Cc: Jason Gunthorpe <[email protected]> > > Cc: David Hildenbrand <[email protected]> > > Cc: Mike Rapoport (Microsoft) <[email protected]> > > Cc: Oscar Salvador <[email protected]> > > Cc: Andrew Morton <[email protected]> > > Cc: Dave Hansen <[email protected]> > > Cc: Andy Lutomirski <[email protected]> > > Cc: Peter Zijlstra <[email protected]> > > Cc: Thomas Gleixner <[email protected]> > > Cc: Ingo Molnar <[email protected]> > > Cc: Borislav Petkov <[email protected]> > > Cc: Dan Williams <[email protected]> > > Cc: Dave Jiang <[email protected]> > > Cc: Vishal Verma <[email protected]> > > Cc: [email protected] > > Cc: [email protected] > > Assisted-by: Claude:claude-opus-4-7 > > Signed-off-by: Juhyung Park <[email protected]> > > --- > > arch/x86/mm/init_64.c | 7 ++++++- > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c > > index df2261fa4f98..a2301bddb647 100644 > > --- a/arch/x86/mm/init_64.c > > +++ b/arch/x86/mm/init_64.c > > @@ -1024,7 +1024,12 @@ static void __meminit free_pagetable(struct page > > *page, int order) > > free_reserved_pages(page, nr_pages); > > #endif > > } else { > > - pagetable_free(page_ptdesc(page)); > > + /* > > + * Use __free_pages() to honor @order: vmemmap PMD leaves > > + * freed here are not compound pages, so pagetable_free() > > + * would lose leak 511 of 512 pages per 2 MB chunk. > > + */ > > + __free_pages(page, order); > > } > > } > > > > I sent a proper fix for this already: > > https://lore.kernel.org/all/[email protected]/ > > -- > Cheers, > > David
