On 2/26/26 04:26, Nico Pache wrote:
> Enable khugepaged to collapse to mTHP orders. This patch implements the
> main scanning logic using a bitmap to track occupied pages and a stack
> structure that allows us to find optimal collapse sizes.
> 
> Previous to this patch, PMD collapse had 3 main phases, a light weight
> scanning phase (mmap_read_lock) that determines a potential PMD
> collapse, a alloc phase (mmap unlocked), then finally heavier collapse
> phase (mmap_write_lock).
> 
> To enabled mTHP collapse we make the following changes:
> 
> During PMD scan phase, track occupied pages in a bitmap. When mTHP
> orders are enabled, we remove the restriction of max_ptes_none during the
> scan phase to avoid missing potential mTHP collapse candidates. Once we
> have scanned the full PMD range and updated the bitmap to track occupied
> pages, we use the bitmap to find the optimal mTHP size.
> 
> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> and determine the best eligible order for the collapse. A stack structure
> is used instead of traditional recursion to manage the search. The
> algorithm recursively splits the bitmap into smaller chunks to find the
> highest order mTHPs that satisfy the collapse criteria. We start by
> attempting the PMD order, then moved on the consecutively lower orders
> (mTHP collapse). The stack maintains a pair of variables (offset, order),
> indicating the number of PTEs from the start of the PMD, and the order of
> the potential collapse candidate.
> 
> The algorithm for consuming the bitmap works as such:
>     1) push (0, HPAGE_PMD_ORDER) onto the stack
>     2) pop the stack
>     3) check if the number of set bits in that (offset,order) pair
>        statisfy the max_ptes_none threshold for that order
>     4) if yes, attempt collapse
>     5) if no (or collapse fails), push two new stack items representing
>        the left and right halves of the current bitmap range, at the
>        next lower order
>     6) repeat at step (2) until stack is empty.
> 
> Below is a diagram representing the algorithm and stack items:
> 
>                            offset       mid_offset
>                             |         |
>                             |         |
>                             v         v
>           ____________________________________
>          |          PTE Page Table            |
>          --------------------------------------
>                           <-------><------->
>                              order-1  order-1
> 
> We currently only support mTHP collapse for max_ptes_none values of 0
> and HPAGE_PMD_NR - 1. resulting in the following behavior:
> 
>     - max_ptes_none=0: Never introduce new empty pages during collapse
>     - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>       available mTHP order
> 
> Any other max_ptes_none value will emit a warning and skip mTHP collapse
> attempts. There should be no behavior change for PMD collapse.
> 
> Once we determine what mTHP sizes fits best in that PMD range a collapse
> is attempted. A minimum collapse order of 2 is used as this is the lowest
> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
> 
> mTHP collapses reject regions containing swapped out or shared pages.
> This is because adding new entries can lead to new none pages, and these
> may lead to constant promotion into a higher order (m)THP. A similar
> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> introducing at least 2x the number of pages, and on a future scan will
> satisfy the promotion condition once again. This issue is prevented via
> the collapse_max_ptes_none() function which imposes the max_ptes_none
> restrictions above.
> 
> Currently madv_collapse is not supported and will only attempt PMD
> collapse.
> 
> We can also remove the check for is_khugepaged inside the PMD scan as
> the collapse_max_ptes_none() function handles this logic now.
> 
> Reviewed-by: Baolin Wang <[email protected]>
> Tested-by: Baolin Wang <[email protected]>
> Signed-off-by: Nico Pache <[email protected]>
> ---


[...]


>  /**
> @@ -1361,17 +1392,138 @@ static enum scan_result collapse_huge_page(struct 
> mm_struct *mm, unsigned long s
>       return result;
>  }
>  
> +static void mthp_stack_push(struct collapse_control *cc, int *stack_size,
> +                                u16 offset, u8 order)

Nit: indentation. Same for other functions.

Wondering if you'd want to call these functions

collapse_mthp_*

> +{
> +     const int size = *stack_size;
> +     struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> +
> +     VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> +     stack->order = order;
> +     stack->offset = offset;
> +     (*stack_size)++;
> +}
> +
> +static struct mthp_range mthp_stack_pop(struct collapse_control *cc, int 
> *stack_size)
> +{
> +     const int size = *stack_size;
> +
> +     VM_WARN_ON_ONCE(size <= 0);
> +     (*stack_size)--;
> +     return cc->mthp_bitmap_stack[size - 1];
> +}
> +
> +static unsigned int mthp_nr_occupied_pte_entries(struct collapse_control *cc,
> +                                              u16 offset, unsigned long 
> nr_pte_entries)

s/pte_entries/ptes/ ?

> +{
> +     bitmap_zero(cc->mthp_bitmap_mask, HPAGE_PMD_NR);
> +     bitmap_set(cc->mthp_bitmap_mask, offset, nr_pte_entries);
> +     return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, 
> HPAGE_PMD_NR);
> +}
> +
> +/*
> + * mthp_collapse() consumes the bitmap that is generated during
> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> + *
> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) 
> page.
> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> + * of the bitmap for collapse eligibility. The stack maintains a pair of
> + * variables (offset, order), indicating the number of PTEs from the start of
> + * the PMD, and the order of the potential collapse candidate respectively. 
> We
> + * start at the PMD order and check if it is eligible for collapse; if not, 
> we
> + * add two entries to the stack at a lower order to represent the left and 
> right
> + * halves of the PTE page table we are examining.
> + *
> + *                         offset       mid_offset
> + *                         |         |
> + *                         |         |
> + *                         v         v
> + *      --------------------------------------
> + *      |          cc->mthp_bitmap            |
> + *      --------------------------------------
> + *                         <-------><------->
> + *                          order-1  order-1
> + *
> + * For each of these, we determine how many PTE entries are occupied in the
> + * range of PTE entries we propose to collapse, then we compare this to a
> + * threshold number of PTE entries which would need to be occupied for a
> + * collapse to be permitted at that order (accounting for max_ptes_none).
> +
> + * If a collapse is permitted, we attempt to collapse the PTE range into a
> + * mTHP.
> + */
> +static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> +             int referenced, int unmapped, struct collapse_control *cc,
> +             bool *mmap_locked, unsigned long enabled_orders)
> +{
> +     unsigned int max_ptes_none, nr_occupied_ptes;
> +     struct mthp_range range;
> +     unsigned long collapse_address;
> +     int collapsed = 0, stack_size = 0;
> +     unsigned long nr_pte_entries;

"nr_ptes" ? Any reason for that to be an unsigned long?

> +     u16 offset;
> +     u8 order;
> +
> +     mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> +
> +     while (stack_size > 0) {
> +             range = mthp_stack_pop(cc, &stack_size);
> +             order = range.order;
> +             offset = range.offset;
> +             nr_pte_entries = 1UL << order;
> +
> +             if (!test_bit(order, &enabled_orders))
> +                     goto next_order;
> +
> +             if (cc->is_khugepaged)
> +                     max_ptes_none = collapse_max_ptes_none(order);
> +             else
> +                     max_ptes_none = COLLAPSE_MAX_PTES_LIMIT;
> +
> +             if (max_ptes_none == -EINVAL)
> +                     return collapsed;

With the previous suggested rework, you could likely make this

        max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
        if (max_ptes_none < 0)
                return collapsed;

> +
> +             nr_occupied_ptes = mthp_nr_occupied_pte_entries(cc, offset, 
> nr_pte_entries);
> +
> +             if (nr_occupied_ptes >= nr_pte_entries - max_ptes_none) {
> +                     int ret;
> +
> +                     collapse_address = address + offset * PAGE_SIZE;
> +                     ret = collapse_huge_page(mm, collapse_address, 
> referenced,
> +                                              unmapped, cc, mmap_locked,
> +                                              order);
> +                     if (ret == SCAN_SUCCEED) {
> +                             collapsed += nr_pte_entries;
> +                             continue;
> +                     }
> +             }
> +
> +next_order:
> +             if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
> +                     const u8 next_order = order - 1;
> +                     const u16 mid_offset = offset + (nr_pte_entries / 2);
> +
> +                     mthp_stack_push(cc, &stack_size, mid_offset, 
> next_order);
> +                     mthp_stack_push(cc, &stack_size, offset, next_order);
> +             }
> +     }
> +     return collapsed;
> +}
> +
>  static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>               struct vm_area_struct *vma, unsigned long start_addr, bool 
> *mmap_locked,
>               unsigned int *cur_progress, struct collapse_control *cc)
>  {
>       pmd_t *pmd;
>       pte_t *pte, *_pte;
> -     int none_or_zero = 0, shared = 0, referenced = 0;
> +     int i;
> +     int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
>       enum scan_result result = SCAN_FAIL;
>       struct page *page = NULL;
> +     unsigned int max_ptes_none;
>       struct folio *folio = NULL;
>       unsigned long addr;
> +     unsigned long enabled_orders;
>       spinlock_t *ptl;
>       int node = NUMA_NO_NODE, unmapped = 0;
>  
> @@ -1384,8 +1536,21 @@ static enum scan_result collapse_scan_pmd(struct 
> mm_struct *mm,
>               goto out;
>       }
>  
> +     bitmap_zero(cc->mthp_bitmap, HPAGE_PMD_NR);
>       memset(cc->node_load, 0, sizeof(cc->node_load));
>       nodes_clear(cc->alloc_nmask);
> +
> +     enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, 
> cc->is_khugepaged);
> +
> +     /*
> +      * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> +      * scan all pages to populate the bitmap for mTHP collapse.
> +      */
> +     if (cc->is_khugepaged && enabled_orders == BIT(HPAGE_PMD_ORDER))
> +             max_ptes_none = collapse_max_ptes_none(HPAGE_PMD_ORDER);
> +     else
> +             max_ptes_none = COLLAPSE_MAX_PTES_LIMIT;
> +

I assume that code to change as well. If you need help figuring out how
to make it work, please shout.

[...]

-- 
Cheers,

David

Reply via email to