On Fri, Sep 12, 2025 at 7:36 AM Lorenzo Stoakes <lorenzo.stoa...@oracle.com> wrote: > > On Thu, Sep 11, 2025 at 09:28:01PM -0600, Nico Pache wrote: > > The current mechanism for determining mTHP collapse scales the > > khugepaged_max_ptes_none value based on the target order. This > > introduces an undesirable feedback loop, or "creep", when max_ptes_none > > is set to a value greater than HPAGE_PMD_NR / 2. > > > > With this configuration, a successful collapse to order N will populate > > enough pages to satisfy the collapse condition on order N+1 on the next > > scan. This leads to unnecessary work and memory churn. > > > > To fix this issue introduce a helper function that caps the max_ptes_none > > to HPAGE_PMD_NR / 2 - 1 (255 on 4k page size). The function also scales > > the max_ptes_none number by the (PMD_ORDER - target collapse order). > > I would say very clearly that this is only in the mTHP case.
ack, I stole most of the verbiage here from other notes I've previously written, but it can be improved. > > > > > > Signed-off-by: Nico Pache <npa...@redhat.com> > > Hmm I thought we were going to wait for David to investigate different > approaches to this? > > This is another issue with quickly going to another iteration. Though I do > think > David explicitly said he'd come back with a solution? Sorry I thought that was being done in lockstep. The last version was about a month ago and I had a lot of changes queued up. Now that we have collapse_max_pte_none() David has a much easier entry point to work off :) I think he will still need this groundwork for the solution he is working on with "eagerness". if 10 -> 511, and 9 ->255, ..., 0 -> 0. It will still have to do the scaling. Although I believe 0-10 should be more like 0-5 mapping to 0,32,64,128,255,511 > > So I'm not sure why we're seeing this solution here? Unless I'm missing > something? > > > --- > > mm/khugepaged.c | 22 +++++++++++++++++++++- > > 1 file changed, 21 insertions(+), 1 deletion(-) > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index b0ae0b63fc9b..4587f2def5c1 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -468,6 +468,26 @@ void __khugepaged_enter(struct mm_struct *mm) > > wake_up_interruptible(&khugepaged_wait); > > } > > > > +/* Returns the scaled max_ptes_none for a given order. > > We don't start comments at the /*, please use a normal comment format like: ack > > /* > * xxxx > */ > > > + * Caps the value to HPAGE_PMD_NR/2 - 1 in the case of mTHP collapse to > > prevent > > This is super unclear. > > It start with 'caps the xxx' which seems like you're talking generally. > > You should say very clearly 'For PMD allocations we apply the > khugepaged_max_ptes_none parameter as normal. For mTHP ... [details about > mTHP]. ack I will clean this up. > > > + * a feedback loop. If max_ptes_none is greater than HPAGE_PMD_NR/2, the > > value > > + * would lead to collapses that introduces 2x more pages than the original > > + * number of pages. On subsequent scans, the max_ptes_none check would be > > + * satisfied and the collapses would continue until the largest order is > > reached > > + */ > > This is a super vauge explanation. Please describe the issue with creep more > clearly. ok I will try to come up with something clearer. > > Also aren't we saying that 511 or 0 are the sensible choices? But now somehow > that's not the case? Oh I stated I wanted to propose this, and although there was some pushback I still thought it deserved another attempt. This still allows for some configurability, and with David's eagerness toggle this still seems to fit nicely. > > You're also not giving a kdoc info on what this returns. Ok I'll add a kdoc here, why this function in particular, I'm trying to understand why we dont add kdocs on other functions? > > > +static int collapse_max_ptes_none(unsigned int order) > > It's a problem that existed already, but khugepaged_max_ptes_none is an > unsigned > int and this returns int. > > Maybe we should fix this while we're at it... ack > > > +{ > > + int max_ptes_none; > > + > > + if (order != HPAGE_PMD_ORDER && > > + khugepaged_max_ptes_none >= HPAGE_PMD_NR/2) > > + max_ptes_none = HPAGE_PMD_NR/2 - 1; > > + else > > + max_ptes_none = khugepaged_max_ptes_none; > > + return max_ptes_none >> (HPAGE_PMD_ORDER - order); > > + > > +} > > + > > I really don't like this formulation, you're making it unnecessarily unclear > and > now, for the super common case of PMD size, you have to figure out 'oh it's > this > second branch and we're subtracting HPAGE_PMD_ORDER from HPAGE_PMD_ORDER so > just > return khugepaged_max_ptes_none'. When we could... just return it no? > > So something like: > > #define MAX_PTES_NONE_MTHP_CAP (HPAGE_PMD_NR / 2 - 1) > > static unsigned int collapse_max_ptes_none(unsigned int order) > { > unsigned int max_ptes_none_pmd; > > /* PMD-sized THPs behave precisely the same as before. */ > if (order == HPAGE_PMD_ORDER) > return khugepaged_max_ptes_none; > > /* > * Bizarrely, this is expressed in terms of PTEs were this PMD-sized. > * For the reasons stated above, we cap this value in the case of mTHP. > */ > max_ptes_none_pmd = MIN(MAX_PTES_NONE_MTHP_CAP, > khugepaged_max_ptes_none); > > /* Apply PMD -> mTHP scaling. */ > return max_ptes_none >> (HPAGE_PMD_ORDER - order); > } yeah that's much cleaner thanks! > > > void khugepaged_enter_vma(struct vm_area_struct *vma, > > vm_flags_t vm_flags) > > { > > @@ -554,7 +574,7 @@ static int __collapse_huge_page_isolate(struct > > vm_area_struct *vma, > > struct folio *folio = NULL; > > pte_t *_pte; > > int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0; > > - int scaled_max_ptes_none = khugepaged_max_ptes_none >> > > (HPAGE_PMD_ORDER - order); > > + int scaled_max_ptes_none = collapse_max_ptes_none(order); > > const unsigned long nr_pages = 1UL << order; > > > > for (_pte = pte; _pte < pte + nr_pages; > > -- > > 2.51.0 > > > > Thanks, Lorenzo >