On Wed, Oct 29, 2025 at 03:10:19PM -0600, Nico Pache wrote: > On Wed, Oct 29, 2025 at 12:42 PM Lorenzo Stoakes > <[email protected]> wrote: > > > > On Wed, Oct 29, 2025 at 04:04:06PM +0100, David Hildenbrand wrote: > > > > > > > > > > No creep, because you'll always collapse. > > > > > > > > OK so in the 511 scenario, do we simply immediately collapse to the > > > > largest > > > > possible _mTHP_ page size if based on adjacent none/zero page entries > > > > in the > > > > PTE, and _never_ collapse to PMD on this basis even if we do have > > > > sufficient > > > > none/zero PTE entries to do so? > > > > > > Right. And if we fail to allocate a PMD, we would collapse to smaller > > > sizes, > > > and later, once a PMD is possible, collapse to a PMD. > > > > > > But there is no creep, as we would have collapsed a PMD right from the > > > start > > > either way. > > > > Hmm, would this mean at 511 mTHP collapse _across zero entries_ would only > > ever collapse to PMD, except in cases where, for instance, PTE entries > > belong to distinct VMAs and so you have to collapse to mTHP as a result? > > There are a few failure cases, like exceeding thresholds, or > allocations failures, but yes your assessment is correct.
Yeah of course being mm there are thorny edge cases :) we do love those... > > At 511, the PMD collapse will be satisfied by a single PTE. If the > collapse fails we will try both sides of the PMD (1024kb , 1024kb). > the one that contains the non-none PTE will collapse Right yes. > > This is where the (HPAGE_PMD_ORDER - order) comes from. > imagine the 511 case above > 511 >> HPAGE_PMD_ORDER - 9 == 511 >> 0 = 511 max ptes none > 511 >> PMD_ORDER - 8 (1024kb) == 511 >> 1 = 255 max_ptes_none > > both of these align to the orders size minus 1. Right. > > > > > Or IOW 'always collapse to the largest size you can I don't care if it > > takes up more memory' > > > > And at 0, we'd never collapse anything across zero entries, and only when > > adjacent present entries can be collapse to mTHP/PMD do we do so? > > Yep! > > max_pte_none =0 + all mTHP sizes enabled, gives you a really good > distribution of mTHP sizes in the systems, as zero memory will be > wasted and the most optimal size (space wise) will eb found. At least > for the memory allocated through khugepaged. The Defer patchset I had > on top of this series was exactly for that purpose-- Allow khugepaged > to determine all the THP usage in the system (other than madvise), and > allow granular control of memory waste. Yeah, well it's a trade off really isn't it on 'eagerness' to collapse non-present entries :) But we'll come back to that when David has time :) > > > > > > > > > > > > > > And only collapse to PMD size if we have sufficient adjacent PTE > > > > entries that > > > > are populated? > > > > > > > > Let's really nail this down actually so we can be super clear what the > > > > issue is > > > > here. > > > > > > > > > > I hope what I wrote above made sense. > > > > Asking some q's still, probably more a me thing :) > > > > > > > > > > > > > > > > > > > Creep only happens if you wouldn't collapse a PMD without prior mTHP > > > > > collapse, but suddenly would in the same scenario simply because you > > > > > had > > > > > prior mTHP collapse. > > > > > > > > > > At least that's my understanding. > > > > > > > > OK, that makes sense, is the logic (this may be part of the bit I > > > > haven't > > > > reviewed yet tbh) then that for khugepaged mTHP we have the system > > > > where we > > > > always require prior mTHP collapse _first_? > > > > > > So I would describe creep as > > > > > > "we would not collapse a PMD THP because max_ptes_none is violated, but > > > because we collapsed smaller mTHP THPs before, we essentially suddenly > > > have > > > more PTEs that are not none-or-zero, making us suddenly collapse a PMD THP > > > at the same place". > > > > Yeah that makes sense. > > > > > > > > Assume the following: max_ptes_none = 256 > > > > > > This means we would only collapse if at most half (256/512) of the PTEs > > > are > > > none-or-zero. > > > > > > But imagine the (simplified) PTE layout with PMD = 8 entries to simplify: > > > > > > [ P Z P Z P Z Z Z ] > > > > > > 3 Present vs. 5 Zero -> do not collapse a PMD (8) > > > > OK I'm thinking this is more about /ratio/ than anything else. > > > > PMD - <=50% - ok 5/8 = 62.5% no collapse. > > < 50%*. > > At 50% it's 256 which is actually the worst case scenario. But I read > further, and it seems like you grasped the issue. Yeah this is < 50% vs. <= 50% which are fundamentally different obviously :) > > > > > > > > > But sssume we collapse smaller mTHP (2 entries) first > > > > > > [ P P P P P P Z Z ] > > > > ...512 KB mTHP (2 entries) - <= 50% means we can do... > > > > > > > > We collapsed 3x "P Z" into "P P" because the ratio allowed for it. > > > > Yes so that's: > > > > [ P Z P Z P Z Z Z ] > > > > -> > > > > [ P P P P P P Z Z ] > > > > Right? > > > > > > > > Suddenly we have > > > > > > 6 Present vs 2 Zero and we collapse a PMD (8) > > > > > > [ P P P P P P P P ] > > > > > > That's the "creep" problem. > > > > I guess we try PMD collapse first then mTHP, but the worry is another pass > > will collapse to PMD right? > > > > > > Whereas < 50% ratio means we never end up 'propagating' or 'creeping' like > > this because each collapse never provides enough reduction in zero entries > > to allow for higher order collapse. > > > > Hence the idea of capping at 255 > > Yep! We've discussed other solutions, like tracking collapsed pages, > or the solutions brought up by David. But this seemed like the most > logical to me, as it keeps some of the tunability. I now understand > the concern wasnt so much the capping, but rather the silent nature of > it, and the uAPI expectations surrounding enforcing such a limit (for > both past and future behavioral expectations). Yes, that's the primary concern on my side. > > > > > > > > > > > > > > > > > > > > > > > > > > > > max_ptes_none == 0 -> collapse mTHP only if all non-none/zero > > > > > > > > > > > > > > And for the intermediate values > > > > > > > > > > > > > > (1) pr_warn() when mTHPs are enabled, stating that mTHP collapse > > > > > > > is not > > > > > > > supported yet with other values > > > > > > > > > > > > It feels a bit much to issue a kernel warning every time somebody > > > > > > twiddles that > > > > > > value, and it's kind of against user expectation a bit. > > > > > > > > > > pr_warn_once() is what I meant. > > > > > > > > Right, but even then it feels a bit extreme, warnings are pretty serious > > > > things. Then again there's precedent for this, and it may be the least > > > > worse > > > > solution. > > > > > > > > I just picture a cloud provider turning this on with mTHP then getting > > > > their > > > > monitoring team reporting some urgent communication about warnings in > > > > dmesg :) > > > > > > I mean, one could make the states mutually, maybe? > > > > > > Disallow enabling mTHP with max_ptes_none set to unsupported values and > > > the > > > other way around. > > > > > > That would probably be cleanest, although the implementation might get a > > > bit > > > more involved (but it's solvable). > > > > > > But the concern could be that there are configs that could suddenly break: > > > someone that set max_ptes_none and enabled mTHP. > > > > Yeah we could always return an error on setting to an unsupported value. > > > > I mean pr_warn() is nasty but maybe necessary. > > > > > > > > > > > I'll note that we could also consider only supporting "max_ptes_none = > > > 511" > > > (default) to start with. > > > > > > The nice thing about that value is that it us fully supported with the > > > underused shrinker, because max_ptes_none=511 -> never shrink. > > > > It feels like = 0 would be useful though? > > I personally think the default of 511 is wrong and should be on the > lower end of the scale. The exception being thp=always, where I > believe the kernel should treat it as 511. I think that'd be confusing to have different behaviour for thp=always, and I'd rather we didn't do that. But ultimately it's all moot I think as these are all uAPI things now. It was a mistake to even export this IMO, but that can't be helped now :) > > But the second part of that would also violate the users max_ptes_none > setting, so it's probably much harder in practice, and also not really > part of this series, just my opinion. I'm confused what you mean here? In any case I think the 511/0 solution is the way forwards. > > Cheers. > -- Nico > > > > > > > > > -- > > > Cheers > > > > > > David / dhildenb > > > > > > > Thanks, Lorenzo > > > Cheers, Lorenzo
