On 20/08/2025 17:53, Kevin Brodsky wrote: > On 15/08/2025 10:54, Kevin Brodsky wrote: >> [...] >> >> Performance >> =========== >> >> No arm64 hardware currently implements POE. To estimate the performance >> impact of kpkeys_hardened_pgtables, a mock implementation of kpkeys has >> been used, replacing accesses to the POR_EL1 register with accesses to >> another system register that is otherwise unused (CONTEXTIDR_EL1), and >> leaving everything else unchanged. Most of the kpkeys overhead is >> expected to originate from the barrier (ISB) that is required after >> writing to POR_EL1, and from setting the POIndex (pkey) in page tables; >> both of these are done exactly in the same way in the mock >> implementation. > It turns out this wasn't the case regarding the pkey setting - because > patch 6 gates set_memory_pkey() on system_supports_poe() and not > arch_kpkeys_enabled(), the mock implementation turned set_memory_pkey() > into a no-op. Many thanks to Rick Edgecombe for highlighting that the > overheads were suspiciously low for some benchmarks! > >> The original implementation of kpkeys_hardened_pgtables is very >> inefficient when many PTEs are changed at once, as the kpkeys level is >> switched twice for every PTE (two ISBs per PTE). Patch 18 introduces >> an optimisation that makes use of the lazy_mmu mode to batch those >> switches: 1. switch to KPKEYS_LVL_PGTABLES on arch_enter_lazy_mmu_mode(), >> 2. skip any kpkeys switch while in that section, and 3. restore the >> kpkeys level on arch_leave_lazy_mmu_mode(). When that last function >> already issues an ISB (when updating kernel page tables), we get a >> further optimisation as we can skip the ISB when restoring the kpkeys >> level. >> >> Both implementations (without and with batching) were evaluated on an >> Amazon EC2 M7g instance (Graviton3), using a variety of benchmarks that >> involve heavy page table manipulations. The results shown below are >> relative to the baseline for this series, which is 6.17-rc1. The >> branches used for all three sets of results (baseline, with/without >> batching) are available in a repository, see next section. >> >> Caveat: these numbers should be seen as a lower bound for the overhead >> of a real POE-based protection. The hardware checks added by POE are >> however not expected to incur significant extra overhead. >> >> Reading example: for the fix_size_alloc_test benchmark, using 1 page per >> iteration (no hugepage), kpkeys_hardened_pgtables incurs 17.35% overhead >> without batching, and 14.62% overhead with batching. Both results are >> considered statistically significant (95% confidence interval), >> indicated by "(R)". >> >> +-------------------+----------------------------------+------------------+---------------+ >> | Benchmark | Result Class | Without batching | >> With batching | >> +===================+==================================+==================+===============+ >> | mmtests/kernbench | real time | 0.30% | >> 0.11% | >> | | system time | (R) 3.97% | >> (R) 2.17% | >> | | user time | 0.12% | >> 0.02% | >> +-------------------+----------------------------------+------------------+---------------+ >> | micromm/fork | fork: h:0 | (R) 217.31% | >> -0.97% | >> | | fork: h:1 | (R) 275.25% | >> (R) 2.25% | >> +-------------------+----------------------------------+------------------+---------------+ >> | micromm/munmap | munmap: h:0 | (R) 15.57% | >> -1.95% | >> | | munmap: h:1 | (R) 169.53% | >> (R) 6.53% | >> +-------------------+----------------------------------+------------------+---------------+ >> | micromm/vmalloc | fix_size_alloc_test: p:1, h:0 | (R) 17.35% | >> (R) 14.62% | >> | | fix_size_alloc_test: p:4, h:0 | (R) 37.54% | >> (R) 9.35% | >> | | fix_size_alloc_test: p:16, h:0 | (R) 66.08% | >> (R) 3.15% | >> | | fix_size_alloc_test: p:64, h:0 | (R) 82.94% | >> -0.39% | >> | | fix_size_alloc_test: p:256, h:0 | (R) 87.85% | >> -1.67% | >> | | fix_size_alloc_test: p:16, h:1 | (R) 50.31% | >> 3.00% | >> | | fix_size_alloc_test: p:64, h:1 | (R) 59.73% | >> 2.23% | >> | | fix_size_alloc_test: p:256, h:1 | (R) 62.14% | >> 1.51% | >> | | random_size_alloc_test: p:1, h:0 | (R) 77.82% | >> -0.21% | >> | | vm_map_ram_test: p:1, h:0 | (R) 30.66% | >> (R) 27.30% | >> +-------------------+----------------------------------+------------------+---------------+ > These numbers therefore correspond to set_memory_pkey() being a no-op, > in other words they represent the overhead of switching the pkey > register only. > > I have amended the mock implementation so that set_memory_pkey() is run > as it would on a real POE implementation (i.e. actually setting the PTE > bits). Here are the new results, representing the overhead of both pkey > register switching and setting the pkey of page table pages (PTPs) on > alloc/free: > > +-------------------+----------------------------------+------------------+---------------+ > | Benchmark | Result Class | Without > batching | With batching | > +===================+==================================+==================+===============+ > | mmtests/kernbench | real time | > 0.32% | 0.35% | > | | system time | (R) > 4.18% | (R) 3.18% | > | | user time | > 0.08% | 0.20% | > +-------------------+----------------------------------+------------------+---------------+ > | micromm/fork | fork: h:0 | (R) > 221.39% | (R) 3.35% | > | | fork: h:1 | (R) > 282.89% | (R) 6.99% | > +-------------------+----------------------------------+------------------+---------------+ > | micromm/munmap | munmap: h:0 | (R) > 17.37% | -0.28% | > | | munmap: h:1 | (R) > 172.61% | (R) 8.08% | > +-------------------+----------------------------------+------------------+---------------+ > | micromm/vmalloc | fix_size_alloc_test: p:1, h:0 | (R) > 15.54% | (R) 12.57% | > | | fix_size_alloc_test: p:4, h:0 | (R) > 39.18% | (R) 9.13% | > | | fix_size_alloc_test: p:16, h:0 | (R) > 65.81% | 2.97% | > | | fix_size_alloc_test: p:64, h:0 | (R) > 83.39% | -0.49% | > | | fix_size_alloc_test: p:256, h:0 | (R) > 87.85% | (I) -2.04% | > | | fix_size_alloc_test: p:16, h:1 | (R) > 51.21% | 3.77% | > | | fix_size_alloc_test: p:64, h:1 | (R) > 60.02% | 0.99% | > | | fix_size_alloc_test: p:256, h:1 | (R) > 63.82% | 1.16% | > | | random_size_alloc_test: p:1, h:0 | (R) > 77.79% | -0.51% | > | | vm_map_ram_test: p:1, h:0 | (R) > 30.67% | (R) 27.09% | > +-------------------+----------------------------------+------------------+---------------+
Apologies, Thunderbird helpfully decided to wrap around that table... Here's the unmangled table: +-------------------+----------------------------------+------------------+---------------+ | Benchmark | Result Class | Without batching | With batching | +===================+==================================+==================+===============+ | mmtests/kernbench | real time | 0.32% | 0.35% | | | system time | (R) 4.18% | (R) 3.18% | | | user time | 0.08% | 0.20% | +-------------------+----------------------------------+------------------+---------------+ | micromm/fork | fork: h:0 | (R) 221.39% | (R) 3.35% | | | fork: h:1 | (R) 282.89% | (R) 6.99% | +-------------------+----------------------------------+------------------+---------------+ | micromm/munmap | munmap: h:0 | (R) 17.37% | -0.28% | | | munmap: h:1 | (R) 172.61% | (R) 8.08% | +-------------------+----------------------------------+------------------+---------------+ | micromm/vmalloc | fix_size_alloc_test: p:1, h:0 | (R) 15.54% | (R) 12.57% | | | fix_size_alloc_test: p:4, h:0 | (R) 39.18% | (R) 9.13% | | | fix_size_alloc_test: p:16, h:0 | (R) 65.81% | 2.97% | | | fix_size_alloc_test: p:64, h:0 | (R) 83.39% | -0.49% | | | fix_size_alloc_test: p:256, h:0 | (R) 87.85% | (I) -2.04% | | | fix_size_alloc_test: p:16, h:1 | (R) 51.21% | 3.77% | | | fix_size_alloc_test: p:64, h:1 | (R) 60.02% | 0.99% | | | fix_size_alloc_test: p:256, h:1 | (R) 63.82% | 1.16% | | | random_size_alloc_test: p:1, h:0 | (R) 77.79% | -0.51% | | | vm_map_ram_test: p:1, h:0 | (R) 30.67% | (R) 27.09% | +-------------------+----------------------------------+------------------+---------------+ > Those results are overall very similar to the original ones. > micromm/fork is however clearly impacted - around 4% additional overhead > from set_memory_pkey(); it makes sense considering that forking requires > duplicating (and therefore allocating) a full set of page tables. > kernbench is also a fork-heavy workload and it gets a 1% hit in system > time (with batching). > > It seems fair to conclude that, on arm64, setting the pkey whenever a > PTP is allocated/freed is not particularly expensive. The situation may > well be different on x86 as Rick pointed out, and it may also change on > newer arm64 systems as I noted further down. Allocating/freeing PTPs in > bulk should help if setting the pkey in the pgtable ctor/dtor proves too > expensive. > > - Kevin > >> Benchmarks: >> - mmtests/kernbench: running kernbench (kernel build) [4]. >> - micromm/{fork,munmap}: from David Hildenbrand's benchmark suite. A >> 1 GB mapping is created and then fork/unmap is called. The mapping is >> created using either page-sized (h:0) or hugepage folios (h:1); in all >> cases the memory is PTE-mapped. >> - micromm/vmalloc: from test_vmalloc.ko, varying the number of pages >> (p:) and whether huge pages are used (h:). >> >> On a "real-world" and fork-heavy workload like kernbench, the estimated >> overhead of kpkeys_hardened_pgtables is reasonable: 4% system time >> overhead without batching, and about half that figure (2.2%) with >> batching. The real time overhead is negligible. >> >> Microbenchmarks show large overheads without batching, which increase >> with the number of pages being manipulated. Batching drastically reduces >> that overhead, almost negating it for micromm/fork. Because all PTEs in >> the mapping are modified in the same lazy_mmu section, the kpkeys level >> is changed just twice regardless of the mapping size; as a result the >> relative overhead actually decreases as the size increases for >> fix_size_alloc_test. >> >> Note: the performance impact of set_memory_pkey() is likely to be >> relatively low on arm64 because the linear mapping uses PTE-level >> descriptors only. This means that set_memory_pkey() simply changes the >> attributes of some PTE descriptors. However, some systems may be able to >> use higher-level descriptors in the future [5], meaning that >> set_memory_pkey() may have to split mappings. Allocating page tables >> from a contiguous cache of pages could help minimise the overhead, as >> proposed for x86 in [1]. >> >> [...]