On Wed, Apr 29, 2026 at 12:49:14PM +0200, David Hildenbrand (Arm) wrote:
> In commit bf9e4e30f353 ("x86/mm: use pagetable_free()"), we switched
> from freeing non-boot page tables through __free_pages() to
> pagetable_free().
>
> However, the function is also called to free vmemmap pages.
>
> Given that vmemmap pages are not page tables, already the page_ptdesc(page)
> is wrong. But worse, pagetable_free() calls
>
> __free_pages(page, compound_order(page));
>
> As vmemmap pages are not compound pages (see vmemmap_alloc_block()) --
> except for HVO, which doesn't apply here -- we will only free the first
> page when freeing a PMD-sized vmemmap page, leaking the other ones.
Hi David,
Sneaking in here to share with nvdimm/dax folks as this affects their
nfit_test environment usage.
+ [email protected]
NVDIMM, DAX folks,
This fixes a memory leak present since v6.19 that surfaces during DAX
and NVDIMM unit testing, as well as ad-hoc nfit_test usage. If you are
seeing the system gradually run out of memory across repeated test runs
or namespace reconfiguration cycles, this is likely the cause.
In my setup, a VM with 5.4 GiB MemAvailable and a 4 GiB nfit_test
namespace lost about 1.1 GiB of MemAvailable per DAX or NVDIMM test suite
run. The VM OOM's partway through the 4th consecutive run of either. The
number of survivable runs scales roughly with available VM memory.
Symptoms typically begin with "page allocation failure: order 0" messages
from unrelated processes. If a test run is active when memory is
sufficiently depleted, it eventually terminates w OOM.
I've tested both this posted fix and a revert of the Fixes commit and both
resolve the leak in my setup. If neither is an option, periodic reboot of
the test environment may be needed for longer test sessions.
-- Alison
>
> Fix it by properly decoupling pagetable and vmemmap freeing.
> free_pagetable() no longer has to mess with SECTION_INFO, as only the
> vmemmap is marked like that in register_page_bootmem_memmap().
>
> The indentation in remove_pmd_table() is messed up, let's fix that
> while touching it.
>
> Note that we'll try to get rid of that bootmem info handling soon. For
> now, we'll handle it similar to free_pagetable(), just avoiding the
> ifdef.
>
> Tested-by: Lance Yang <[email protected]>
> Acked-by: Mike Rapoport (Microsoft) <[email protected]>
> Fixes: bf9e4e30f353 ("x86/mm: use pagetable_free()")
> Cc: [email protected]
> Signed-off-by: David Hildenbrand (Arm) <[email protected]>
> ---
> Reproduced and tested with a simple VM with a virtio-mem device,
> repeatedly adding and removing memory.
>
> Found by code inspection while working on bootmem_info removal.
> ---
> Changes in v2:
> - Don't mess with the altmap with PTEs and add a comment why.
> - Simplify "unsigned long nr_pages" handling.
> - Link to v1:
> https://lore.kernel.org/r/[email protected]
> ---
> arch/x86/mm/init_64.c | 40 ++++++++++++++++++++++++++--------------
> 1 file changed, 26 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index df2261fa4f98..7e20b22d658b 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1014,7 +1014,7 @@ static void __meminit free_pagetable(struct page *page,
> int order)
> #ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
> enum bootmem_type type = bootmem_type(page);
>
> - if (type == SECTION_INFO || type == MIX_SECTION_INFO) {
> + if (type == MIX_SECTION_INFO) {
> while (nr_pages--)
> put_page_bootmem(page++);
> } else {
> @@ -1028,13 +1028,24 @@ static void __meminit free_pagetable(struct page
> *page, int order)
> }
> }
>
> -static void __meminit free_hugepage_table(struct page *page,
> +static void __meminit free_vmemmap_pages(struct page *page, unsigned int
> order,
> struct vmem_altmap *altmap)
> {
> - if (altmap)
> - vmem_altmap_free(altmap, PMD_SIZE / PAGE_SIZE);
> - else
> - free_pagetable(page, get_order(PMD_SIZE));
> + unsigned long nr_pages = 1u << order;
> +
> + if (altmap) {
> + vmem_altmap_free(altmap, nr_pages);
> + } else if (PageReserved(page)) {
> + if (IS_ENABLED(CONFIG_HAVE_BOOTMEM_INFO_NODE) &&
> + bootmem_type(page) == SECTION_INFO) {
> + while (nr_pages--)
> + put_page_bootmem(page++);
> + } else {
> + free_reserved_pages(page, nr_pages);
> + }
> + } else {
> + __free_pages(page, order);
> + }
> }
>
> static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
> @@ -1118,7 +1129,8 @@ remove_pte_table(pte_t *pte_start, unsigned long addr,
> unsigned long end,
> return;
>
> if (!direct)
> - free_pagetable(pte_page(*pte), 0);
> + /* We never populate base pages from the altmap. */
> + free_vmemmap_pages(pte_page(*pte), 0, NULL);
>
> spin_lock(&init_mm.page_table_lock);
> pte_clear(&init_mm, addr, pte);
> @@ -1153,19 +1165,19 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long
> addr, unsigned long end,
> if (IS_ALIGNED(addr, PMD_SIZE) &&
> IS_ALIGNED(next, PMD_SIZE)) {
> if (!direct)
> - free_hugepage_table(pmd_page(*pmd),
> - altmap);
> + free_vmemmap_pages(pmd_page(*pmd),
> + PMD_ORDER, altmap);
>
> spin_lock(&init_mm.page_table_lock);
> pmd_clear(pmd);
> spin_unlock(&init_mm.page_table_lock);
> pages++;
> } else if (vmemmap_pmd_is_unused(addr, next)) {
> - free_hugepage_table(pmd_page(*pmd),
> - altmap);
> - spin_lock(&init_mm.page_table_lock);
> - pmd_clear(pmd);
> - spin_unlock(&init_mm.page_table_lock);
> + free_vmemmap_pages(pmd_page(*pmd), PMD_ORDER,
> + altmap);
> + spin_lock(&init_mm.page_table_lock);
> + pmd_clear(pmd);
> + spin_unlock(&init_mm.page_table_lock);
> }
> continue;
> }
>
> ---
>
> base-commit: a2ddbfd1af0f54ea84bf17f0400088815d012e8d
>
> change-id: 20260428-vmemmap-ab4b949aa727
>
> --
>
> Cheers,
>
> David
>