On Mon, Jun 8, 2026 at 8:57 AM David Hildenbrand (Arm) <[email protected]> wrote:
>
> On 6/6/26 12:28, Lance Yang wrote:
> >
> > On Fri, Jun 05, 2026 at 10:14:18AM -0600, Nico Pache wrote:
> >> Enable khugepaged to collapse to mTHP orders. This patch implements the
> >> main scanning logic using a bitmap to track occupied pages and the
> >> algorithm to find optimal collapse sizes.
> >>
> >> Previous to this patch, PMD collapse had 3 main phases, a light weight
> >> scanning phase (mmap_read_lock) that determines a potential PMD
> >> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> >> phase (mmap_write_lock).
> >>
> >> To enabled mTHP collapse we make the following changes:
> >>
> >> During PMD scan phase, track occupied pages in a bitmap. When mTHP
> >> orders are enabled, we remove the restriction of max_ptes_none during the
> >> scan phase to avoid missing potential mTHP collapse candidates. Once we
> >> have scanned the full PMD range and updated the bitmap to track occupied
> >> pages, we use the bitmap to find the optimal mTHP size.
> >>
> >> Implement mthp_collapse() to walk forward through the bitmap and
> >> determine the best eligible order for each naturally-aligned region. The
> >> algorithm starts at the beginning of the PMD range and, for each offset,
> >> tries the highest order that fits the alignment. If the number of
> >> occupied PTEs in that region satisfies the max_ptes_none threshold for
> >> that order, a collapse is attempted. On failure, the order is
> >> decremented and the same offset is retried at the next smaller size. Once
> >> the smallest enabled order is exhausted (or a collapse succeeds), the
> >> offset advances past the region just processed, and the next attempt
> >> starts at the highest order permitted by the new offset's natural
> >> alignment.
> >>
> >> The algorithm works as follows:
> >> 1) set offset=0 and order=HPAGE_PMD_ORDER
> >> 2) if the order is not enabled, go to step (5)
> >> 3) count occupied PTEs in the (offset, order) range using
> >> bitmap_weight_from()
> >> 4) if the count satisfies the max_ptes_none threshold, attempt
> >> collapse; on success, advance to step (6)
> >> 5) if a smaller enabled order exists, decrement order and retry
> >> from step (2) at the same offset
> >> 6) advance offset past the current region and compute the next
> >> order from the new offset's natural alignment via __ffs(offset),
> >> capped at HPAGE_PMD_ORDER
> >> 7) repeat from step (2) until the full PMD range is covered
> >>
> >> mTHP collapses reject regions containing swapped out or shared pages.
> >> This is because adding new entries can lead to new none pages, and these
> >> may lead to constant promotion into a higher order mTHP. A similar
> >> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> >> introducing at least 2x the number of pages, and on a future scan will
> >> satisfy the promotion condition once again. This issue is prevented via
> >> the collapse_max_ptes_none() function which imposes the max_ptes_none
> >> restrictions above.
> >>
> >> We currently only support mTHP collapse for max_ptes_none values of 0
> >> and HPAGE_PMD_NR - 1. resulting in the following behavior:
> >>
> >> - max_ptes_none=0: Never introduce new empty pages during collapse
> >> - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
> >> available mTHP order
> >>
> >> Any other max_ptes_none value will emit a warning and default mTHP
> >> collapse to max_ptes_none=0. There should be no behavior change for PMD
> >> collapse.
> >>
> >> Once we determine what mTHP sizes fits best in that PMD range a collapse
> >> is attempted. A minimum collapse order of 2 is used as this is the lowest
> >> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
> >>
> >> Currently madv_collapse is not supported and will only attempt PMD
> >> collapse.
> >>
> >> We can also remove the check for is_khugepaged inside the PMD scan as
> >> the collapse_max_ptes_none() function handles this logic now.
> >>
> >> Signed-off-by: Nico Pache <[email protected]>
> >> ---
> >> mm/khugepaged.c | 146 +++++++++++++++++++++++++++++++++++++++++++++---
> >> 1 file changed, 138 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> index ec886a031952..430047316f43 100644
> >> --- a/mm/khugepaged.c
> >> +++ b/mm/khugepaged.c
> >> @@ -99,6 +99,8 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash,
> >> MM_SLOTS_HASH_BITS);
> >>
> >> static struct kmem_cache *mm_slot_cache __ro_after_init;
> >>
> >> +#define KHUGEPAGED_MIN_MTHP_ORDER 2
> >> +
> >> struct collapse_control {
> >> bool is_khugepaged;
> >>
> >> @@ -110,6 +112,9 @@ struct collapse_control {
> >>
> >> /* nodemask for allocation fallback */
> >> nodemask_t alloc_nmask;
> >> +
> >> + /* Each bit represents a single occupied (!none/zero) page. */
> >> + DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE);
> >> };
> >>
> >> /**
> >> @@ -1440,20 +1445,130 @@ static enum scan_result collapse_huge_page(struct
> >> mm_struct *mm, unsigned long s
> >> return result;
> >> }
> >>
> >> +/* Return the highest naturally aligned order that fits at @offset within
> >> a PMD. */
> >> +static unsigned int max_order_from_offset(unsigned int offset)
> >> +{
> >> + if (offset == 0)
> >> + return HPAGE_PMD_ORDER;
> >> +
> >> + return min_t(unsigned int, __ffs(offset), HPAGE_PMD_ORDER);
> >> +}
> >> +
> >> +/*
> >> + * mthp_collapse() consumes the bitmap that is generated during
> >> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> >> + *
> >> + * Each bit in cc->mthp_present_ptes represents a single occupied
> >> (!none/zero)
> >> + * page. We start at the PMD order and check if it is eligible for
> >> collapse;
> >> + * if not, we check the left and right halves of the PTE page table we are
> >> + * examining at a lower order.
> >> + *
> >> + * For each of these, we determine how many PTE entries are occupied in
> >> the
> >> + * range of PTE entries we propose to collapse, then we compare this to a
> >> + * threshold number of PTE entries which would need to be occupied for a
> >> + * collapse to be permitted at that order (accounting for max_ptes_none).
> >> + *
> >> + * If a collapse is permitted, we attempt to collapse the PTE range into a
> >> + * mTHP.
> >> + */
> >> +static enum scan_result mthp_collapse(struct mm_struct *mm,
> >> + unsigned long address, int referenced, int unmapped,
> >> + struct collapse_control *cc, unsigned long enabled_orders)
> >> +{
> >> + unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> >> + enum scan_result last_result = SCAN_FAIL;
> >> + int collapsed = 0;
> >> + bool alloc_failed = false;
> >> + unsigned long collapse_address;
> >> + unsigned int offset = 0;
> >> + unsigned int order = HPAGE_PMD_ORDER;
> >> +
> >> + while (offset < HPAGE_PMD_NR) {
> >> + nr_ptes = 1UL << order;
> >> +
> >> + if (!test_bit(order, &enabled_orders))
> >> + goto next_order;
> >> +
> >> + max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> >> + nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes,
> >> offset,
> >> + offset + nr_ptes);
> >> +
> >> + if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> >
> > Looks broken for swap PTEs in PMD collapse ...
> >
> > collapse_scan_pmd() allows them up to max_ptes_swap and record them in
> > unmapped, but they don't get a bit in mthp_present_ptes. And then
> > mthp_collapse() does the check above:
>
> Right. I assumed this is implicitly handled by the optimization in
> collapse_scan_pmd:
>
> if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
>
> But we perform the check a second time.
>
> >
> > nr_occupied_ptes >= nr_ptes - max_ptes_none
> >
> > So max_ptes_none=0 + 511 present PTEs + one allowed swap PTE won't even
> > call collapse_huge_page() for PMD order.
> >
> > Shouldn't we account for them in the PMD-order check? Something like:
> >
> > if (is_pmd_order(order))
> > nr_occupied_ptes += unmapped;
This solution seems good for a temporary fixup. but longterm we may
want something else. I'm still not sure how we plan on supporting
swapin without causing creep. So I'd be ok with adding a fix for
legacy PMD behavior until we know how to handle mTHP creep correctly.
> As an alternative, we could either 1) skip the check there for
> pmd order (as the check was already done); or 2) introduce+maintain
> a bitmap that tracks non-present PTEs.
>
> @@ -1475,7 +1477,9 @@ static enum scan_result mthp_collapse(struct mm_struct
> *mm,
> nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes,
> offset,
> offset + nr_ptes);
>
> - if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> + /* Check was already done in the caller. */
> + if (is_pmd_order(order) ||
> + nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> enum scan_result ret;
>
> collapse_address = address + offset * PAGE_SIZE;
>
> 2) would probably be cleanest long-term.
That would be best for future swapin support in mTHP, but I still
don't think it solves the creep issue. Perhaps we could combine the
two bitmaps to determine if it would make the future collapse eligible
again? Not sure but ill start thinking about it.
Should I send a fixup for this using Lance's solution? Or does Lance
want to send a patch out with the fixes tag?
>
> --
> Cheers,
>
> David
>