On Fri, Sep 12, 2025 at 06:28:55PM -0600, Nico Pache wrote: > On Fri, Sep 12, 2025 at 12:22 PM Lorenzo Stoakes > <lorenzo.stoa...@oracle.com> wrote: > > > > On Fri, Sep 12, 2025 at 07:53:22PM +0200, David Hildenbrand wrote: > > > On 12.09.25 17:51, Lorenzo Stoakes wrote: > > > > With all this stuff said, do we have an actual plan for what we intend > > > > to do > > > > _now_? > > > > > > Oh no, no I have to use my brain and it's Friday evening. > > > > I apologise :) > > > > > > > > > > > > > As Nico has implemented a basic solution here that we all seem to agree > > > > is not > > > > what we want. > > > > > > > > Without needing special new hardware or major reworks, what would this > > > > parameter > > > > look like? > > > > > > > > What would the heuristics be? What about the eagerness scales? > > > > > > > > I'm but a simple kernel developer, > > > > > > :) > > > > > > and interested in simple pragmatic stuff :) > > > > do you have a plan right now David? > > > > > > Ehm, if you ask me that way ... > > > > > > > > > > > Maybe we can start with something simple like a rough percentage per > > > > eagerness > > > > entry that then gets scaled based on utilisation? > > > > > > ... I think we should probably: > > > > > > 1) Start with something very simple for mTHP that doesn't lock us into > > > any particular direction. > > > > Yes. > > > > > > > > 2) Add an "eagerness" parameter with fixed scale and use that for mTHP as > > > well > > > > Yes I think we're all pretty onboard with that it seems! > > > > > > > > 3) Improve that "eagerness" algorithm using a dynamic scale or #whatever > > > > Right, I feel like we could start with some very simple linear thing here > > and > > later maybe refine it? > > I agree, something like 0,32,64,128,255,511 seem to map well, and is > not too different from what im doing with the scaling by > (HPAGE_PMD_ORDER - order).
Actually, I suspect something like what David suggests in [0] is probably the better way, but as I said there I think it should be an internal implementation detail as to what this ultimately ends up being. The idea is we provide an abstract thing a user can set, and the kernel figures out how best to interpret that. [0]:https://lore.kernel.org/linux-mm/cd8e7f1c-a563-4ae9-a0fb-b0d04a4c3...@redhat.com/ > > > > > > > > > 4) Solve world peace and world hunger > > > > Yes! That would be pretty great ;) > This should probably be a larger priority :))) > > > > > > > > 5) Connect it all to memory pressure / reclaim / shrinker / heuristics / > > > hw hotness / #whatever > > > > I think these are TODOs :) > > > > > > > > > > > I maintain my initial position that just using > > > > > > max_ptes_none == 511 -> collapse mTHP always > > > max_ptes_none != 511 -> collapse mTHP only if we all PTEs are > > > non-none/zero > > > > > > As a starting point is probably simple and best, and likely leaves room > > > for any > > > changes later. > > > > Yes. > > > > > > > > > > > Of course, we could do what Nico is proposing here, as 1) and change it > > > all later. > > > > Right. > > > > But that does mean for mTHP we're limited to 256 (or 255 was it?) but I > > guess > > given the 'creep' issue that's sensible. > > I dont think thats much different to what david is trying to propose, > given eagerness=9 would be 50%. I think q > at 10 or 511, no matter what, you will only ever collapse to the > largest enabled order. > The difference in my approach is that technically, with PMD disabled, > and 511, you would still need 50% utilization to collapse, which is > not ideal if you always want to collapse to some mTHP size even with 1 > page occupied. With davids solution this is solved by never allowing > anything in between 255-511. Right. Except we default to max eagerness (or min, I asked David about the values there :P) So aren't we, by default, broken on mTHP? Maybe we can change the default though... > > > > > > > > > It's just when it comes to documenting all that stuff in patch #15 that I > > > feel like > > > "alright, we shouldn't be doing it longterm like that, so let's not make > > > anybody > > > depend on any weird behavior here by over-domenting it". > > > > > > I mean > > > > > > " > > > +To prevent "creeping" behavior where collapses continuously promote to > > > larger > > > +orders, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on 4K page size), it is > > > +capped to HPAGE_PMD_NR/2 - 1 for mTHP collapses. This is due to the fact > > > +that introducing more than half of the pages to be non-zero it will > > > always > > > +satisfy the eligibility check on the next scan and the region will be > > > collapse. > > > " > > > > > > Is just way, way to detailed. > > > > > > I would just say "The kernel might decide to use a more conservative > > > approach > > > when collapsing smaller THPs" etc. > > > > > > > > > Thoughts? > > > > Well I've sort of reviewed oppositely there :) well at least that it needs > > to be > > a hell of a lot clearer (I find that comment really compressed and I just > > don't > > really understand it). > > I think your review is still valid to improve the internal code > comment. I think David is suggesting to not be so specific in the > actual admin-guide docs as we move towards a more opaque tunable. Yeah thanks for pointing that out! We were talking across purposes. > > > > > I guess I didn't think about people reading that and relying on it, so > > maybe we > > could alternatively make that succinct. > > > > But I think it'd be better to say something like "mTHP collapse cannot > > currently > > correctly function with half or more of the PTE entries empty, so we cap at > > just > > below this level" in this case. > > Some middle ground might be the best answer, not too specific, but > also allude to the interworking a little. Yeah actually I agree with David re: documentation, my comments were wrt err... comments :P only. > > Cheers, > -- Nico Cheers, Lorenzo