On Mon, May 11, 2026 at 12:58:00PM -0600, Nico Pache wrote: >The following series provides khugepaged with the capability to collapse >anonymous memory regions to mTHPs. > >To achieve this we generalize the khugepaged functions to no longer depend >on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual >pages that are occupied (!none/zero). After the PMD scan is done, we use >the bitmap to find the optimal mTHP sizes for the PMD range. The >restriction on max_ptes_none is removed during the scan, to make sure we >account for the whole PMD range in the bitmap. When no mTHP size is >enabled, the legacy behavior of khugepaged is maintained. > >We currently only support max_ptes_none values of 0 or HPAGE_PMD_NR - 1 >(ie 511). If any other value is specified, the kernel will emit a warning >and no mTHP collapse will be attempted. If a mTHP collapse is attempted, >but contains swapped out, or shared pages, we don't perform the collapse. >It is now also possible to collapse to mTHPs without requiring the PMD THP >size to be enabled. These limitations are to prevent collapse "creep" >behavior. This prevents constantly promoting mTHPs to the next available >size, which would occur because a collapse introduces more non-zero pages >that would satisfy the promotion condition on subsequent scans. > >Patch 1-2: Generalize hugepage_vma_revalidate and alloc_charge_folio > for arbitrary orders. >Patch 3: Rework max_ptes_* handling into helper functions >Patch 4: Generalize __collapse_huge_page_* for mTHP support >Patch 5: Require collapse_huge_page to enter/exit with the lock dropped >Patch 6: Generalize collapse_huge_page for mTHP collapse >Patch 7: Skip collapsing mTHP to smaller orders >Patch 8-9: Add per-order mTHP statistics and tracepoints >Patch 10: Introduce collapse_allowable_orders helper function >Patch 11-13: Introduce bitmap and mTHP collapse support, fully enabled >Patch 14: Documentation > >Testing: >- Built for x86_64, aarch64, ppc64le, and s390x >- ran all arches on test suites provided by the kernel-tests project >- internal testing suites: functional testing and performance testing >- selftests mm >- I created a test script that I used to push khugepaged to its limits > while monitoring a number of stats and tracepoints. The code is > available here[1] (Run in legacy mode for these changes and set mthp > sizes to inherit) > The summary from my testings was that there was no significant > regression noticed through this test. In some cases my changes had > better collapse latencies, and was able to scan more pages in the same > amount of time/work, but for the most part the results were consistent. >- redis testing. I did some testing with these changes along with my defer > changes (see followup [2] post for more details). We've decided to get > the mTHP changes merged first before attempting the defer series. >- some basic testing on 64k page size. >- lots of general use. >
Two links are missing. I got them from previous version. [1] - https://gitlab.com/npache/khugepaged_mthp_test [2] - https://lore.kernel.org/lkml/[email protected]/ And the test in [1] is a performance test. I am thinking whether we want a functional test in selftests. I did a quick try with following change and some hack. @@ -744,6 +765,51 @@ static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *o ksft_test_result_report(exit_status, "%s\n", __func__); } +static void collapse_mth_ptes(struct collapse_context *c, struct mem_ops *ops) +{ + struct thp_settings settings = *thp_current_settings(); + void *p; + int i; + + /* Disable mthp on fault */ + for (i = 0; i < NR_ORDERS; i++) { + settings.hugepages[i].enabled = THP_NEVER; + } + thp_push_settings(&settings); + + p = ops->setup_area(1); + + ops->fault(p, 0, hpage_pmd_size); + + /* Expect all order-0 folio after fault */ + memset(expected_orders, 0, sizeof(int) * (pmd_order + 1)); + expected_orders[0] = hpage_pmd_nr; + if (check_folio_orders(p, hpage_pmd_size, pagemap_fd, + kpageflags_fd, expected_orders, + (pmd_order + 1))) + ksft_exit_fail_msg("Unexpected huge page at fault\n"); + + /* Enable mthp before collapse */ + thp_pop_settings(); + settings.hugepages[2].enabled = THP_ALWAYS; + thp_push_settings(&settings); + + c->collapse("Collapse fully populated PTE table with order 2", p, 1, + ops, true); + + /* Expect all order-2 folio after collapse */ + memset(expected_orders, 0, sizeof(int) * (pmd_order + 1)); + expected_orders[2] = 1 << (pmd_order - 2); + if (check_folio_orders(p, hpage_pmd_size, pagemap_fd, + kpageflags_fd, expected_orders, + (pmd_order + 1))) + ksft_exit_fail_msg("Unexpected page order\n"); + + ops->cleanup_area(p, hpage_pmd_size); + thp_pop_settings(); + ksft_test_result_report(exit_status, "%s\n", __func__); +} + static void collapse_swapin_single_pte(struct collapse_context *c, struct mem_ops *ops) { void *p; This leverage check_after_split_folio_orders() in split_huge_page_test.c to check folio order in PMD range. -- Wei Yang Help you, Help me
