On Fri, May 02, 2025 at 05:18:54PM +0200, David Hildenbrand wrote: > On 02.05.25 14:50, Jann Horn wrote: > > On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <da...@redhat.com> wrote: > > > On 02.05.25 00:29, Nico Pache wrote: > > > > On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <ja...@google.com> wrote: > > > > > > > > > > On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npa...@redhat.com> wrote: > > > > > > Introduce the ability for khugepaged to collapse to different mTHP > > > > > > sizes. > > > > > > While scanning PMD ranges for potential collapse candidates, keep > > > > > > track > > > > > > of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit > > > > > > represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER > > > > > > ptes. If > > > > > > mTHPs are enabled we remove the restriction of max_ptes_none during > > > > > > the > > > > > > scan phase so we dont bailout early and miss potential mTHP > > > > > > candidates. > > > > > > > > > > > > After the scan is complete we will perform binary recursion on the > > > > > > bitmap to determine which mTHP size would be most efficient to > > > > > > collapse > > > > > > to. max_ptes_none will be scaled by the attempted collapse order to > > > > > > determine how full a THP must be to be eligible. > > > > > > > > > > > > If a mTHP collapse is attempted, but contains swapped out, or shared > > > > > > pages, we dont perform the collapse. > > > > > [...] > > > > > > @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct > > > > > > mm_struct *mm, unsigned long address, > > > > > > vma_start_write(vma); > > > > > > anon_vma_lock_write(vma->anon_vma); > > > > > > > > > > > > - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, > > > > > > address, > > > > > > - address + HPAGE_PMD_SIZE); > > > > > > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, > > > > > > _address, > > > > > > + _address + (PAGE_SIZE << order)); > > > > > > mmu_notifier_invalidate_range_start(&range); > > > > > > > > > > > > pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ > > > > > > + > > > > > > /* > > > > > > * This removes any huge TLB entry from the CPU so we > > > > > > won't allow > > > > > > * huge and small TLB entries for the same virtual > > > > > > address to > > > > > > > > > > It's not visible in this diff, but we're about to do a > > > > > pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the > > > > > entire page table, meaning it tears down 2MiB of address space; and it > > > > > assumes that the entire page table exclusively corresponds to the > > > > > current VMA. > > > > > > > > > > I think you'll need to ensure that the pmdp_collapse_flush() only > > > > > happens for full-size THP, and that mTHP only tears down individual > > > > > PTEs in the relevant range. (That code might get a bit messy, since > > > > > the existing THP code tears down PTEs in a detached page table, while > > > > > mTHP would have to do it in a still-attached page table.) > > > > Hi Jann! > > > > > > > > I was under the impression that this is needed to prevent GUP-fast > > > > races (and potentially others). > > > > Why would you need to touch the PMD entry to prevent GUP-fast races for > > mTHP? > > > > > > As you state here, conceptually the PMD case is, detach the PMD, do > > > > the collapse, then reinstall the PMD (similarly to how the system > > > > recovers from a failed PMD collapse). I tried to keep the current > > > > locking behavior as it seemed the easiest way to get it right (and not > > > > break anything). So I keep the PMD detaching and reinstalling for the > > > > mTHP case too. As Hugh points out I am releasing the anon lock too > > > > early. I will comment further on his response. > > > > As I see it, you're not "keeping" the current locking behavior; you're > > making a big implicit locking change by reusing a codepath designed > > for PMD THP for mTHP, where the page table may not be exclusively > > owned by one VMA. > > That is not the intention. The intention in this series (at least as we > discussed) was to not do it across VMAs; that is considered the next logical > step (which will be especially relevant on arm64 IMHO). > > > > > > > As I familiarize myself with the code more, I do see potential code > > > > improvements/cleanups and locking improvements, but I was going to > > > > leave those to a later series. > > > > > > Right, the simplest approach on top of the current PMD collapse is to do > > > exactly what we do in the PMD case, including the locking: which > > > apparently is no completely the same yet :). > > > > > > Instead of installing a PMD THP, we modify the page table and remap that. > > > > > > Moving from the PMD lock to the PTE lock will not make a big change in > > > practice for most cases: we already must disable essentially all page > > > table walkers (vma lock, mmap lock in write, rmap lock in write). > > > > > > The PMDP clear+flush is primarily to disable the last possible set of > > > page table walkers: (1) HW modifications and (2) GUP-fast. > > > > > > So after the PMDP clear+flush we know that (A) HW can not modify the > > > pages concurrently and (B) GUP-fast cannot succeed anymore. > > > > > > The issue with PTEP clear+flush is that we will have to remember all PTE > > > values, to reset them if anything goes wrong. Using a single PMD value > > > is arguably simpler. And then, the benefit vs. complexity is unclear. > > > > > > Certainly something to look into later, but not a requirement for the > > > first support, > > > > As I understand, one rule we currently have in MM is that an operation > > that logically operates on one VMA (VMA 1) does not touch the page > > tables of other VMAs (VMA 2) in any way, except that it may walk page > > tables that cover address space that intersects with both VMA 1 and > > VMA 2, and create such page tables if they are missing. > > Yes, absolutely. That must not happen. And I think I raised it as a problem > in reply to one of Dev's series. > > If this series does not rely on that it must be fixed. > > > > > This proposed patch changes that, without explicitly discussing this > > locking change. > > Yes, that must not happen. We must not zap a PMD to temporarily replace it > with a pmd_none() entry if any other sane page table walker could stumble > over it. > > This includes another VMA that is not write-locked that could span the PMD.
I feel like we should document these restrictions somewhere :) Perhaps in a new page table walker doc, or on the https://origin.kernel.org/doc/html/latest/mm/process_addrs.html page. Which sounds like I'm volunteering myself to do so doesn't it... [adds to todo...] > > -- > Cheers, > > David / dhildenb >