On Tue, Oct 28, 2025 at 8:10 PM Baolin Wang <[email protected]> wrote: > > > > On 2025/10/29 02:59, Lorenzo Stoakes wrote: > > On Tue, Oct 28, 2025 at 07:08:38PM +0100, David Hildenbrand wrote: > >> > >>>>> Hey Lorenzo, > >>>>> > >>>>>> I mean not to beat a dead horse re: v11 commentary, but I thought we > >>>>>> were going > >>>>>> to implement David's idea re: the new 'eagerness' tunable, and again > >>>>>> we're now just > >>>>>> implementing the capping at HPAGE_PMD_NR/2 - 1 thing again? > >>>>> > >>>>> I spoke to David and he said to continue forward with this series; the > >>>>> "eagerness" tunable will take some time, and may require further > >>>>> considerations/discussion. > >>>> > >>>> Right, after talking to Johannes it got clearer that what we envisioned > >>>> with > >>> > >>> I'm not sure that you meant to say go ahead with the series as-is with > >>> this > >>> silent capping? > >> > >> No, "go ahead" as in "let's find some way forward that works for all and is > >> not too crazy". > > > > Right we clearly needed to discuss that further at the time but that's moot > > now, > > we're figuring it out now :) > > > >> > >> [...] > >> > >>>> "eagerness" would not be like swappiness, and we will really have to be > >>>> careful here. I don't know yet when I will have time to look into that. > >>> > >>> I guess I missed this part of the converastion, what do you mean? > >> > >> Johannes raised issues with that on the list and afterwards we had an > >> offline discussion about some of the details and why something > >> unpredictable > >> is not good. > > > > Could we get these details on-list so we can discuss them? This doesn't > > have to > > be urgent, but I would like to have a say in this or at least be part of the > > converastion please. > > > >> > >>> > >>> The whole concept is that we have a paramaeter whose value is > >>> _abstracted_ and > >>> which we control what it means. > >>> > >>> I'm not sure exactly why that would now be problematic? The fundamental > >>> concept > >>> seems sound no? Last I remember of the conversation this was the case. > >> > >> The basic idea was to do something abstracted as swappiness. Turns out > >> "swappiness" is really something predictable, not something we can randomly > >> change how it behaves under the hood. > >> > >> So we'd have to find something similar for "eagerness", and that's where it > >> stops being easy. > > > > I think we shouldn't be too stuck on > > > >> > >>> > >>>> > >>>> If we want to avoid the implicit capping, I think there are the following > >>>> possible approaches > >>>> > >>>> (1) Tolerate creep for now, maybe warning if the user configures it. > >>> > >>> I mean this seems a viable option if there is pressure to land this series > >>> before we have a viable uAPI for configuring this. > >>> > >>> A part of me thinks we shouldn't rush series in for that reason though and > >>> should require that we have a proper control here. > >>> > >>> But I guess this approach is the least-worst as it leaves us with the most > >>> options moving forwards. > >> > >> Yes. There is also the alternative of respecting only 0 / 511 for mTHP > >> collapse for now as discussed in the other thread. > > > > Yes I guess let's carry that on over there. > > > > I mean this is why I said it's better to try to keep things in one thread > > :) but > > anyway, we've forked and can't be helped now. > > > > To be clear that was a criticism of - email development - not you. > > > > It's _extremely easy_ to have this happen because one thread naturally > > leads to > > a broader discussion of a given topic, whereas another has questions from > > somebody else about the same topic, to which people reply and then... you > > have a > > fork and it can't be helped. > > > > I guess I'm saying it'd be good if we could say 'ok let's move this to X'. > > > > But that's also broken in its own way, you can't stop people from replying > > in > > the other thread still and yeah. It's a limitation of this model :) > > > >> > >>> > >>>> (2) Avoid creep by counting zero-filled pages towards none_or_zero. > >>> > >>> Would this really make all that much difference? > >> > >> It solves the creep problem I think, but it's a bit nasty IMHO. > > > > Ah because you'd end up wtih a bunch of zeroed pages from the prior mTHP > > collapses, interesting... > > > > Scanning for that does seem a bit nasty though yes... > > > >> > >>> > >>>> (3) Have separate toggles for each THP size. Doesn't quite solve the > >>>> problem, only shifts it. > >>> > >>> Yeah I did wonder about this as an alternative solution. But of course it > >>> then > >>> makes it vague what the parent values means in respect of the individual > >>> levels, > >>> unless we have an 'inherit' mode there too (possible). > >>> > >>> It's going to be confusing though as max_ptes_none sits at the root > >>> khugepaged/ > >>> level and I don't think any other parameter from khugepaged/ is exposed at > >>> individual page size levels. > >>> > >>> And of course doing this means we > >>> > >>>> > >>>> Anything else? > >>> > >>> Err... I mean I'm not sure if you missed it but I suggested an approach > >>> in the > >>> sub-thread - exposing mthp_max_ptes_none as a _READ-ONLY_ field at: > >>> > >>> /sys/kernel/mm/transparent_hugepage/khugepaged/max_mthp_ptes_none > >>> > >>> Then we allow the capping, but simply document that we specify what the > >>> capped > >>> value will be here for mTHP. > >> > >> I did not have time to read the details on that so far. > > > > OK. It is a bit nasty, yes. The idea is to find something that allows the > > capping to work. > > > >> > >> It would be one solution forward. I dislike it because I think the whole > >> capping is an intermediate thing that can be (and likely must be, when > >> considering mTHP underused shrinking I think) solved in the future > >> differently. That's why I would prefer adding this only if there is no > >> other, simpler, way forward. > > > > Yes I agree that if we could avoid it it'd be great. > > > > Really I proposed this solution on the basis that we were somehow ok with > > the > > capping. > > > > If we can avoid that'd be ideal as it reduces complexity and 'unexpected' > > behaviour. > > > > We'll clarify on the other thread, but the 511/0 was compelling to me > > before as > > a simplification, and if we can have a straightforward model of how mTHP > > collapse across none/zero page PTEs behaves this is ideal. > > > > The only question is w.r.t. warnings etc. but we can handle details there. > > > >> > >>> > >>> That struck me as the simplest way of getting this series landed without > >>> necessarily violating any future eagerness which: > >>> > >>> a. Must still support khugepaged/max_ptes_none - we aren't getting away > >>> from > >>> this, it's uAPI. > >>> > >>> b. Surely must want to do different things for mTHP in eagerness, so if > >>> we're > >>> exposing some PTE value in max_ptes_none doing so in > >>> khugepaged/mthp_max_ptes_none wouldn't be problematic (note again - > >>> it's > >>> readonly so unlike max_ptes_none we don't have to worry about the > >>> other > >>> direction). > >>> > >>> HOWEVER, eagerness might want want to change this behaviour per-mTHP > >>> size, in > >>> which case perhaps mthp_max_ptes_none would be problematic in that it is > >>> some > >>> kind of average. > >>> > >>> Then again we could always revert to putting this parameter as in (3) in > >>> that > >>> case, ugly but kinda viable. > >>> > >>>> > >>>> IIUC, creep is less of a problem when we have the underused shrinker > >>>> enabled: whatever we over-allocated can (unless longterm-pinned etc) get > >>>> reclaimed again. > >>>> > >>>> So maybe having underused-shrinker support for mTHP as well would be a > >>>> solution to tackle (1) later? > >>> > >>> How viable is this in the short term? > >> > >> I once started looking into it, but it will require quite some work, > >> because > >> the lists will essentially include each and every (m)THP in the system ... > >> so i think we will need some redesign. > > > > Ack. > > > > This aligns with non-0/511 settings being non-functional for mTHP atm > > anyway. > > > >> > >>> > >>> Another possible solution: > >>> > >>> If mthp_max_ptes_none is not workable, we could have a toggle at, e.g.: > >>> > >>> /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_cap_collapse_none > >>> > >>> As a simple boolean. If switched on then we document that it caps mTHP as > >>> per Nico's suggestion. > >>> > >>> That way we avoid the 'silent' issue I have with all this and it's an > >>> explicit setting. > >> > >> Right, but it's another toggle I wish we wouldn't need. We could of course > >> also make it some compile-time option, but not sure if that's really any > >> better. > >> > >> I'd hope we find an easy way forward that doesn't require new toggles, at > >> least for now ... > > > > Right, well I agree if we can make this 0/511 thing work, let's do that. > > > > Toggle are just 'least worst' workarounds on assumption of the need for > > capping. > > I finally finished reading through the discussions across multiple > threads:), and it looks like we've reached a preliminary consensus (make > 0/511 work). Great and thanks! > > IIUC, the strategy is, configuring it to 511 means always enabling mTHP > collapse, configuring it to 0 means collapsing mTHP only if all PTEs are > non-none/zero, and for other values, we issue a warning and prohibit > mTHP collapse (avoid Lorenzo's concern about silently changing > max_ptes_none). Then the implementation for collapse_max_ptes_none() > should be as follows: > > static int collapse_max_ptes_none(unsigned int order, bool full_scan) > { > /* ignore max_ptes_none limits */ > if (full_scan) > return HPAGE_PMD_NR - 1; > > if (order == HPAGE_PMD_ORDER) > return khugepaged_max_ptes_none; > > /* > * To prevent creeping towards larger order collapses for mTHP > collapse, > * we restrict khugepaged_max_ptes_none to only 511 or 0, > simplifying the > * logic. This means: > * max_ptes_none == 511 -> collapse mTHP always > * max_ptes_none == 0 -> collapse mTHP only if we all PTEs are > non-none/zero > */ > if (!khugepaged_max_ptes_none || khugepaged_max_ptes_none == > HPAGE_PMD_NR - 1) > return khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - > order); > > pr_warn_once("mTHP collapse only supports > khugepaged_max_ptes_none configured as 0 or %d\n", HPAGE_PMD_NR - 1); > return -EINVAL; > } > > So what do you think?
Yes i'm glad we finally came to some consensus, despite it being a less than ideal solution. Hopefully the eagerness patchset re-introduces all the lost functionality in the future. Cheers -- Nico >
