On 2025/9/3 04:23, Usama Arif wrote:


On 02/09/2025 12:03, David Hildenbrand wrote:
On 02.09.25 12:34, Usama Arif wrote:


On 02/09/2025 10:03, David Hildenbrand wrote:
On 02.09.25 04:28, Baolin Wang wrote:


On 2025/9/2 00:46, David Hildenbrand wrote:
On 29.08.25 03:55, Baolin Wang wrote:


On 2025/8/28 18:48, Dev Jain wrote:

On 28/08/25 3:16 pm, Baolin Wang wrote:
(Sorry for chiming in late)

On 2025/8/22 22:10, David Hildenbrand wrote:
Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1),
but not sure
if we have to add that for now.

Yeah not so sure about this, this is a 'just have to know' too, and
yes you
might add it to the docs, but people are going to be mightily
confused, esp if
it's a calculated value.

I don't see any other way around having a separate tunable if we
don't just have
something VERY simple like on/off.

Yeah, not advocating that we add support for other values than 0/511,
really.


Also the mentioned issue sounds like something that needs to be
fixed elsewhere
honestly in the algorithm used to figure out mTHP ranges (I may be
wrong - and
happy to stand corrected if this is somehow inherent, but reallly
feels that
way).

I think the creep is unavoidable for certain values.

If you have the first two pages of a PMD area populated, and you
allow for at least half of the #PTEs to be non/zero, you'd collapse
first a
order-2 folio, then and order-3 ... until you reached PMD order.

So for now we really should just support 0 / 511 to say "don't
collapse if there are holes" vs. "always collapse if there is at
least one pte used".

If we only allow setting 0 or 511, as Nico mentioned before, "At 511,
no mTHP collapses would ever occur anyway, unless you have 2MB
disabled and other mTHP sizes enabled. Technically, at 511, only the
highest enabled order would ever be collapsed."
I didn't understand this statement. At 511, mTHP collapses will occur if
khugepaged cannot get a PMD folio. Our goal is to collapse to the
highest order folio.

Yes, I’m not saying that it’s incorrect behavior when set to 511. What I
mean is, as in the example I gave below, users may only want to allow a
large order collapse when the number of present PTEs reaches half of the
large folio, in order to avoid RSS bloat.

How do these users control allocation at fault time where this parameter
is completely ignored?

Sorry, I did not get your point. Why does the 'max_pte_none' need to
control allocation at fault time? Could you be more specific? Thanks.

The comment over khugepaged_max_ptes_none gives a hint:

/*
   * default collapse hugepages if there is at least one pte mapped like
   * it would have happened if the vma was large enough during page
   * fault.
   *
   * Note that these are only respected if collapse was initiated by khugepaged.
   */

In the common case (for anything that really cares about RSS bloat) you will 
just a
get a THP during page fault and consequently RSS bloat.

As raised in my other reply, the only documented reason to set max_ptes_none=0 
seems
to be when an application later (after once possibly getting a THP already 
during
page faults) did some MADV_DONTNEED and wants to control the usage of THPs 
itself using
MADV_COLLAPSE.

It's a questionable use case, that already got more problematic with mTHP and 
page
table reclaim.

Let me explain:

Before mTHP, if someone would MADV_DONTNEED (resulting in
a page table with at least one pte_none entry), there would have been no way we 
would
get memory over-allocated afterwards with max_ptes_none=0.

(1) Page faults would spot "there is a page table" and just fallback to order-0 
pages.
(2) khugepaged was told to not collapse through max_ptes_none=0.

But now:

(A) With mTHP during page-faults, we can just end up over-allocating memory in 
such
      an area again: page faults will simply spot a bunch of pte_nones around 
the fault area
      and install an mTHP.

(B) With page table reclaim (when zapping all PTEs in a table at once), we will 
reclaim the
      page table. The next page fault will just try installing a PMD THP again, 
because there is
      no PTE table anymore.

So I question the utility of max_ptes_none. If you can't tame page faults, then 
there is only
limited sense in taming khugepaged. I think there is vale in setting 
max_ptes_none=0 for some
corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.

Thanks David for your explanation. I see your point now.

For PMD mapped THPs with THP shrinker, this has changed. You can basically tame 
pagefaults, as when you encounter
memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR 
-1 (i.e. 511 for x86), and
will break down those hugepages and free up zero-filled memory.

You are not really taming page faults, though, you are undoing what page faults 
might have messed up :)

I have seen in our prod workloads where
the memory usage and THP usage can spike (usually when the workload starts), 
but with memory pressure,
the memory usage is lower compared to with max_ptes_none = 511, while still 
still keeping the benefits
of THPs like lower TLB misses.

Thanks for raising that: I think the current behavior is in place such that you 
don't bounce back-and-forth between khugepaged collapse and shrinker-split.


Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of 
these things thrashing the effect of the other.

There are likely other ways to achieve that, when we have in mind that the thp 
shrinker will install zero pages and max_ptes_none includes
zero pages.


I do agree that the value of max_ptes_none is magical and different workloads 
can react very differently
to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 
256, it does not mean
that the memory regression of using THP=always vs THP=madvise is halved.

To which value would you set it? Just 510? 0?


There are some very large workloads in the meta fleet that I experimented with 
and found that having
a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was 
found to be an optimal
comprimise in terms of application metrics improving, having an acceptable 
amount of memory regression and
improved system level metrics (lower TLB misses, lower page faults). I am sure 
there was a better value out
there for these workloads, but not possible to experiment with every value.

In terms of wider rollout across the fleet, we are going to target 0 (or a very 
very small value)
when moving from THP=madvise to always. Mainly because it is the least likely 
to cause a memory regression as
THP shrinker will deal with page faults faulting in mostly zero-filled pages 
and khugepaged wont collapse
pages that are dominated by 4K zero-filled chunks.

Thanks for sharing this. We're also investigating what max_ptes_none should be set to in order to use the THP shrinker properly, and currently, our customers always set max_ptes_none to its default value: 511, which is not good.

If 0 is better, it seems like there isn't much conflict with the values expected by mTHP collapse (0 and 511). Sounds good to me.

Reply via email to