CCing ppc folks

On 2/11/26 13:49, Usama Arif wrote:
When the kernel creates a PMD-level THP mapping for anonymous pages,
it pre-allocates a PTE page table and deposits it via
pgtable_trans_huge_deposit(). This deposited table is withdrawn during
PMD split or zap. The rationale was that split must not fail—if the
kernel decides to split a THP, it needs a PTE table to populate.

However, every anon THP wastes 4KB (one page table page) that sits
unused in the deposit list for the lifetime of the mapping. On systems
with many THPs, this adds up to significant memory waste. The original
rationale is also not an issue. It is ok for split to fail, and if the
kernel can't find an order 0 allocation for split, there are much bigger
problems. On large servers where you can easily have 100s of GBs of THPs,
the memory usage for these tables is 200M per 100G. This memory could be
used for any other usecase, which include allocating the pagetables
required during split.

This patch removes the pre-deposit for anonymous pages on architectures
where arch_needs_pgtable_deposit() returns false (every arch apart from
powerpc, and only when radix hash tables are not enabled) and allocates
the PTE table lazily—only when a split actually occurs. The split path
is modified to accept a caller-provided page table.

PowerPC exception:

It would have been great if we can completely remove the pagetable
deposit code and this commit would mostly have been a code cleanup patch,
unfortunately PowerPC has hash MMU, it stores hash slot information in
the deposited page table and pre-deposit is necessary. All deposit/
withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
behavior is unchanged with this patch. On a better note,
arch_needs_pgtable_deposit will always evaluate to false at compile time
on non PowerPC architectures and the pre-deposit code will not be
compiled in.

Is there a way to remove this? It's always been a confusing hack, now it's unpleasant to have around :)

In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1 copied generic pgtable_trans_huge_deposit() hurts my belly.


IIUC, hash is mostly used on legacy power systems, radix on newer ones.

So one obvious solution: remove PMD THP support for hash MMUs along with all this hacky deposit code.


the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar checks need to be wrapped in a reasonable helper and likely this all needs to get cleaned up further.

The implementation if the generic pgtable_trans_huge_deposit and the radix handlers etc must be removed. If any code would trigger them it would be a bug.

If we have to keep this around, pgtable_trans_huge_deposit() should likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there will not be generic support for it.

--
Cheers,

David

Reply via email to