On Wed, Sep 03, 2025 at 08:54:39PM -0600, Nico Pache wrote: > On Tue, Sep 2, 2025 at 2:23 PM Usama Arif <usamaarif...@gmail.com> wrote: > > >>> So I question the utility of max_ptes_none. If you can't tame page > > >>> faults, then there is only > > >>> limited sense in taming khugepaged. I think there is vale in setting > > >>> max_ptes_none=0 for some > > >>> corner cases, but I am yet to learn why max_ptes_none=123 would make > > >>> any sense. > > >>> > > >>> > > >> > > >> For PMD mapped THPs with THP shrinker, this has changed. You can > > >> basically tame pagefaults, as when you encounter > > >> memory pressure, the shrinker kicks in if the value is less than > > >> HPAGE_PMD_NR -1 (i.e. 511 for x86), and > > >> will break down those hugepages and free up zero-filled memory. > > > > > > You are not really taming page faults, though, you are undoing what page > > > faults might have messed up :) > > > > > > I have seen in our prod workloads where > > >> the memory usage and THP usage can spike (usually when the workload > > >> starts), but with memory pressure, > > >> the memory usage is lower compared to with max_ptes_none = 511, while > > >> still still keeping the benefits > > >> of THPs like lower TLB misses. > > > > > > Thanks for raising that: I think the current behavior is in place such > > > that you don't bounce back-and-forth between khugepaged collapse and > > > shrinker-split. > > > > > > > Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one > > of these things thrashing the effect of the other. > I believe with mTHP support in khugepaged, the max_ptes_none value in > the shrinker must also leverage the 'order' scaling to properly > prevent thrashing.
No please do not extend this 'scalling' stuff somewhere else, it's really horrid. We have to find an alternative to that, it's extremely confusing in what is already extremely confusing THP code. As I said before, if we can't have a boolean we need another interface, which makes most sense to be a ratio or in practice, a percentage sysctl. Speaking with David off-list, maybe the answer - if we must have this - is to add a new percentage interface and have this in lock-step with the existing max_ptes_none interface. One updates the other, but internally we're just using the percentage value. > I've been testing a patch for this that I might include in the V11. > > > > > There are likely other ways to achieve that, when we have in mind that > > > the thp shrinker will install zero pages and max_ptes_none includes > > > zero pages. > > > > > >> > > >> I do agree that the value of max_ptes_none is magical and different > > >> workloads can react very differently > > >> to it. The relationship is definitely not linear. i.e. if I use > > >> max_ptes_none = 256, it does not mean > > >> that the memory regression of using THP=always vs THP=madvise is halved. > > > > > > To which value would you set it? Just 510? 0? > > > > > > > There are some very large workloads in the meta fleet that I experimented > > with and found that having > > a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 > > was found to be an optimal > > comprimise in terms of application metrics improving, having an acceptable > > amount of memory regression and > > improved system level metrics (lower TLB misses, lower page faults). I am > > sure there was a better value out > > there for these workloads, but not possible to experiment with every value. (->Usama) It's a pity that such workloads exist. But then the percentage solution should work. > > > > In terms of wider rollout across the fleet, we are going to target 0 (or a > > very very small value) > > when moving from THP=madvise to always. Mainly because it is the least > > likely to cause a memory regression as > > THP shrinker will deal with page faults faulting in mostly zero-filled > > pages and khugepaged wont collapse > > pages that are dominated by 4K zero-filled chunks. > > (->Usama) Interesting though that you've decided against doing this fleetwide... I wonder then again whether we truly need non-boolean values. But the fact workloads might theoretically exist where it's useful does make me think we have to have this, sadly. Cheers, Lorenzo