On Fri, Sep 12, 2025 at 06:28:55PM -0600, Nico Pache wrote:
> On Fri, Sep 12, 2025 at 12:22 PM Lorenzo Stoakes
> <lorenzo.stoa...@oracle.com> wrote:
> >
> > On Fri, Sep 12, 2025 at 07:53:22PM +0200, David Hildenbrand wrote:
> > > On 12.09.25 17:51, Lorenzo Stoakes wrote:
> > > > With all this stuff said, do we have an actual plan for what we intend 
> > > > to do
> > > > _now_?
> > >
> > > Oh no, no I have to use my brain and it's Friday evening.
> >
> > I apologise :)
> >
> > >
> > > >
> > > > As Nico has implemented a basic solution here that we all seem to agree 
> > > > is not
> > > > what we want.
> > > >
> > > > Without needing special new hardware or major reworks, what would this 
> > > > parameter
> > > > look like?
> > > >
> > > > What would the heuristics be? What about the eagerness scales?
> > > >
> > > > I'm but a simple kernel developer,
> > >
> > > :)
> > >
> > > and interested in simple pragmatic stuff :)
> > > > do you have a plan right now David?
> > >
> > > Ehm, if you ask me that way ...
> > >
> > > >
> > > > Maybe we can start with something simple like a rough percentage per 
> > > > eagerness
> > > > entry that then gets scaled based on utilisation?
> > >
> > > ... I think we should probably:
> > >
> > > 1) Start with something very simple for mTHP that doesn't lock us into 
> > > any particular direction.
> >
> > Yes.
> >
> > >
> > > 2) Add an "eagerness" parameter with fixed scale and use that for mTHP as 
> > > well
> >
> > Yes I think we're all pretty onboard with that it seems!
> >
> > >
> > > 3) Improve that "eagerness" algorithm using a dynamic scale or #whatever
> >
> > Right, I feel like we could start with some very simple linear thing here 
> > and
> > later maybe refine it?
>
> I agree, something like 0,32,64,128,255,511 seem to map well, and is
> not too different from what im doing with the scaling by
> (HPAGE_PMD_ORDER - order).

Actually, I suspect something like what David suggests in [0] is probably the
better way, but as I said there I think it should be an internal implementation
detail as to what this ultimately ends up being.

The idea is we provide an abstract thing a user can set, and the kernel figures
out how best to interpret that.

[0]:https://lore.kernel.org/linux-mm/cd8e7f1c-a563-4ae9-a0fb-b0d04a4c3...@redhat.com/

>
> >
> > >
> > > 4) Solve world peace and world hunger
> >
> > Yes! That would be pretty great ;)
> This should probably be a larger priority

:)))

> >
> > >
> > > 5) Connect it all to memory pressure / reclaim / shrinker / heuristics / 
> > > hw hotness / #whatever
> >
> > I think these are TODOs :)
> >
> > >
> > >
> > > I maintain my initial position that just using
> > >
> > > max_ptes_none == 511 -> collapse mTHP always
> > > max_ptes_none != 511 -> collapse mTHP only if we all PTEs are 
> > > non-none/zero
> > >
> > > As a starting point is probably simple and best, and likely leaves room 
> > > for any
> > > changes later.
> >
> > Yes.
> >
> > >
> > >
> > > Of course, we could do what Nico is proposing here, as 1) and change it 
> > > all later.
> >
> > Right.
> >
> > But that does mean for mTHP we're limited to 256 (or 255 was it?) but I 
> > guess
> > given the 'creep' issue that's sensible.
>
> I dont think thats much different to what david is trying to propose,
> given eagerness=9 would be 50%.

I think q

> at 10 or 511, no matter what, you will only ever collapse to the
> largest enabled order.
> The difference in my approach is that technically, with PMD disabled,
> and 511, you would still need 50% utilization to collapse, which is
> not ideal if you always want to collapse to some mTHP size even with 1
> page occupied. With davids solution this is solved by never allowing
> anything in between 255-511.

Right. Except we default to max eagerness (or min, I asked David about the
values there :P)

So aren't we, by default, broken on mTHP? Maybe we can change the default 
though...

>
> >
> > >
> > > It's just when it comes to documenting all that stuff in patch #15 that I 
> > > feel like
> > > "alright, we shouldn't be doing it longterm like that, so let's not make 
> > > anybody
> > > depend on any weird behavior here by over-domenting it".
> > >
> > > I mean
> > >
> > > "
> > > +To prevent "creeping" behavior where collapses continuously promote to 
> > > larger
> > > +orders, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on 4K page size), it is
> > > +capped to HPAGE_PMD_NR/2 - 1 for mTHP collapses. This is due to the fact
> > > +that introducing more than half of the pages to be non-zero it will 
> > > always
> > > +satisfy the eligibility check on the next scan and the region will be 
> > > collapse.
> > > "
> > >
> > > Is just way, way to detailed.
> > >
> > > I would just say "The kernel might decide to use a more conservative 
> > > approach
> > > when collapsing smaller THPs" etc.
> > >
> > >
> > > Thoughts?
> >
> > Well I've sort of reviewed oppositely there :) well at least that it needs 
> > to be
> > a hell of a lot clearer (I find that comment really compressed and I just 
> > don't
> > really understand it).
>
> I think your review is still valid to improve the internal code
> comment. I think David is suggesting to not be so specific in the
> actual admin-guide docs as we move towards a more opaque tunable.

Yeah thanks for pointing that out! We were talking across purposes.

>
> >
> > I guess I didn't think about people reading that and relying on it, so 
> > maybe we
> > could alternatively make that succinct.
> >
> > But I think it'd be better to say something like "mTHP collapse cannot 
> > currently
> > correctly function with half or more of the PTE entries empty, so we cap at 
> > just
> > below this level" in this case.
>
> Some middle ground might be the best answer, not too specific, but
> also allude to the interworking a little.

Yeah actually I agree with David re: documentation, my comments were wrt
err... comments :P only.

>
> Cheers,
> -- Nico

Cheers, Lorenzo

Reply via email to