On 4/19/26 20:57, Nico Pache wrote:
> Pass an order and offset to collapse_huge_page to support collapsing anon
> memory to arbitrary orders within a PMD. order indicates what mTHP size we
> are attempting to collapse to, and offset indicates were in the PMD to
> start the collapse attempt.
>
> For non-PMD collapse we must leave the anon VMA write locked until after
> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> the mTHP case this is not true, and we must keep the lock to prevent
> access/changes to the page tables. This can happen if the rmap walkers hit
> a pmd_none while the PMD entry is currently unavailable due to being
> temporarily removed during the collapse phase.
>
> Signed-off-by: Nico Pache <[email protected]>
> ---
> mm/khugepaged.c | 103 +++++++++++++++++++++++++++---------------------
> 1 file changed, 57 insertions(+), 46 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 283bb63854a5..ff6f9f1883ed 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1198,42 +1198,36 @@ static enum scan_result alloc_charge_folio(struct
> folio **foliop, struct mm_stru
> return SCAN_SUCCEED;
> }
>
> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned
> long address,
> - int referenced, int unmapped, struct collapse_control *cc)
> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned
> long start_addr,
> + int referenced, int unmapped, struct collapse_control *cc,
> + unsigned int order)
> {
> LIST_HEAD(compound_pagelist);
> pmd_t *pmd, _pmd;
> - pte_t *pte;
> + pte_t *pte = NULL;
> pgtable_t pgtable;
> struct folio *folio;
> spinlock_t *pmd_ptl, *pte_ptl;
> enum scan_result result = SCAN_FAIL;
> struct vm_area_struct *vma;
> struct mmu_notifier_range range;
> + bool anon_vma_locked = false;
> + const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> + const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
In general, const read better when they are at the very top of this list.
>
> - VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -
> - /*
> - * Before allocating the hugepage, release the mmap_lock read lock.
> - * The allocation can take potentially a long time if it involves
> - * sync compaction, and we do not need to hold the mmap_lock during
> - * that. We will recheck the vma after taking it again in write mode.
> - */
> - mmap_read_unlock(mm);
> -
You should spell out that locking change (moving it to the caller), and why it
is required, in the patch description.
I'd even have put this into a separate patch, as it's independent to the
order-passing changes.
[...]
> */
> __folio_mark_uptodate(folio);
> - pgtable = pmd_pgtable(_pmd);
> -
> spin_lock(pmd_ptl);
> - BUG_ON(!pmd_none(*pmd));
> - pgtable_trans_huge_deposit(mm, pmd, pgtable);
> - map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> + WARN_ON_ONCE(!pmd_none(*pmd));
> + if (is_pmd_order(order)) { /* PMD collapse */
> + pgtable = pmd_pgtable(_pmd);
> + pgtable_trans_huge_deposit(mm, pmd, pgtable);
> + map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> + } else { /* mTHP collapse */
Do both these comments (PMD collapse ...) really add any value? I'd say it's
pretty self-documenting code already.
> + map_anon_folio_pte_nopf(folio, pte, vma, start_addr,
> /*uffd_wp=*/ false);
> + smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> + pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> + }
> spin_unlock(pmd_ptl);
>
> folio = NULL;
>
> result = SCAN_SUCCEED;
> out_up_write:
> + if (anon_vma_locked)
> + anon_vma_unlock_write(vma->anon_vma);
> + if (pte)
> + pte_unmap(pte);
--
Cheers,
David