On 5/22/26 16:59, Nico Pache wrote:
> generalize the order of the __collapse_huge_page_* and collapse_max_*
> functions to support future mTHP collapse.
> 
> The current mechanism for determining collapse with the
> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
> raises a key design issue: if we support user defined max_pte_none values
> (even those scaled by order), a collapse of a lower order can introduces
> an feedback loop, or "creep", when max_ptes_none is set to a value greater
> than HPAGE_PMD_NR / 2. [1]
> 
> With this configuration, a successful collapse to order N will populate
> enough pages to satisfy the collapse condition on order N+1 on the next
> scan. This leads to unnecessary work and memory churn.
> 
> To fix this issue introduce a helper function that will limit mTHP
> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
> This effectively supports two modes: [2]
> 
> - max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
>   that maps the shared zeropage. Consequently, no memory bloat.
> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
>   available mTHP order.
> 
> This removes the possibility of "creep", and a warning will be emitted if
> any non-supported max_ptes_none value is configured with mTHP enabled.
> Any intermediate value will default mTHP collapse to max_ptes_none=0.
> 
> mTHP collapse will not honor the khugepaged_max_ptes_shared or
> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
> shared or swapped entry.
> 
> No functional changes in this patch; however it defines future behavior
> for mTHP collapse.
> 
> [1] - https://lore.kernel.org/all/[email protected]
> [2] - 
> https://lore.kernel.org/all/[email protected]
> 
> Reviewed-by: Lance Yang <[email protected]>
> Co-developed-by: Dev Jain <[email protected]>
> Signed-off-by: Dev Jain <[email protected]>
> Signed-off-by: Nico Pache <[email protected]>
> ---
>  mm/khugepaged.c | 121 +++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 88 insertions(+), 33 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 116f39518948..e98ba5b15163 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -353,30 +353,52 @@ static bool pte_none_or_zero(pte_t pte)
>   * the shared zeropage for the given collapse operation.
>   * @cc: The collapse control struct
>   * @vma: The vma to check for userfaultfd
> + * @order: The folio order being collapsed to
>   *
>   * Return: Maximum number of empty/shared zeropage PTEs for the collapse 
> operation
>   */
>  static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> -             struct vm_area_struct *vma)
> +             struct vm_area_struct *vma, unsigned int order)
>  {
> +     unsigned int max_ptes_none = khugepaged_max_ptes_none;

Can be const, right?

> +
>       if (vma && userfaultfd_armed(vma))
>               return 0;
>       /* for MADV_COLLAPSE, allow any empty/shared zeropage PTEs */
>       if (!cc->is_khugepaged)
>               return HPAGE_PMD_NR;
> -     /* For all other cases respect the user defined maximum */
> -     return khugepaged_max_ptes_none;
> +     /* for PMD collapse, respect the user defined maximum */
> +     if (is_pmd_order(order))
> +             return max_ptes_none;
> +     /*
> +      * for mTHP collapse with the sysctl value set to 
> KHUGEPAGED_MAX_PTES_LIMIT,
> +      * scale the maximum number of PTEs to the order of the collapse.
> +      */
> +     if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
> +             return (1 << order) - 1;
> +     if (!max_ptes_none)
> +             return 0;
> +     /*
> +      * For mTHP collapse of values other than 0 or 
> KHUGEPAGED_MAX_PTES_LIMIT,
> +      * emit a warning and return 0.
> +      */
> +     pr_warn_once("mTHP collapse does not support max_ptes_none values"
> +                  " other than 0 or %u, defaulting to 0.\n",
> +                  KHUGEPAGED_MAX_PTES_LIMIT);
> +     return 0;

This might read slightly clearer as

/*
 * For mTHP ...
 */
if (max_ptes_none)
                pr_warn_once(...)
return 0;

IOW, have a single "return 0" label here and only special-case when to warn.

Acked-by: David Hildenbrand (arm) <[email protected]>

-- 
Cheers,

David

Reply via email to