On 4/19/26 20:57, Nico Pache wrote:
> Pass an order and offset to collapse_huge_page to support collapsing anon
> memory to arbitrary orders within a PMD. order indicates what mTHP size we
> are attempting to collapse to, and offset indicates were in the PMD to
> start the collapse attempt.
> 
> For non-PMD collapse we must leave the anon VMA write locked until after
> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> the mTHP case this is not true, and we must keep the lock to prevent
> access/changes to the page tables. This can happen if the rmap walkers hit
> a pmd_none while the PMD entry is currently unavailable due to being
> temporarily removed during the collapse phase.
> 
> Signed-off-by: Nico Pache <[email protected]>
> ---
>  mm/khugepaged.c | 103 +++++++++++++++++++++++++++---------------------
>  1 file changed, 57 insertions(+), 46 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 283bb63854a5..ff6f9f1883ed 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1198,42 +1198,36 @@ static enum scan_result alloc_charge_folio(struct 
> folio **foliop, struct mm_stru
>       return SCAN_SUCCEED;
>  }
>  
> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned 
> long address,
> -             int referenced, int unmapped, struct collapse_control *cc)
> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned 
> long start_addr,
> +             int referenced, int unmapped, struct collapse_control *cc,
> +             unsigned int order)
>  {
>       LIST_HEAD(compound_pagelist);
>       pmd_t *pmd, _pmd;
> -     pte_t *pte;
> +     pte_t *pte = NULL;
>       pgtable_t pgtable;
>       struct folio *folio;
>       spinlock_t *pmd_ptl, *pte_ptl;
>       enum scan_result result = SCAN_FAIL;
>       struct vm_area_struct *vma;
>       struct mmu_notifier_range range;
> +     bool anon_vma_locked = false;
> +     const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> +     const unsigned long end_addr = start_addr + (PAGE_SIZE << order);

In general, const read better when they are at the very top of this list.

>  
> -     VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -
> -     /*
> -      * Before allocating the hugepage, release the mmap_lock read lock.
> -      * The allocation can take potentially a long time if it involves
> -      * sync compaction, and we do not need to hold the mmap_lock during
> -      * that. We will recheck the vma after taking it again in write mode.
> -      */
> -     mmap_read_unlock(mm);
> -

You should spell out that locking change (moving it to the caller), and why it
is required, in the patch description.

I'd even have put this into a separate patch, as it's independent to the
order-passing changes.
[...]

>        */
>       __folio_mark_uptodate(folio);
> -     pgtable = pmd_pgtable(_pmd);
> -
>       spin_lock(pmd_ptl);
> -     BUG_ON(!pmd_none(*pmd));
> -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
> -     map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> +     WARN_ON_ONCE(!pmd_none(*pmd));
> +     if (is_pmd_order(order)) { /* PMD collapse */
> +             pgtable = pmd_pgtable(_pmd);
> +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +             map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> +     } else { /* mTHP collapse */

Do both these comments (PMD collapse ...) really add any value? I'd say it's
pretty self-documenting code already.

> +             map_anon_folio_pte_nopf(folio, pte, vma, start_addr, 
> /*uffd_wp=*/ false);
> +             smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> +             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> +     }
>       spin_unlock(pmd_ptl);
>  
>       folio = NULL;
>  
>       result = SCAN_SUCCEED;
>  out_up_write:
> +     if (anon_vma_locked)
> +             anon_vma_unlock_write(vma->anon_vma);
> +     if (pte)
> +             pte_unmap(pte);

-- 
Cheers,

David

Reply via email to