Re: [RFC PATCH v5 00/18] pkeys-based page table hardening

Kevin Brodsky Wed, 20 Aug 2025 09:46:00 -0700

On 20/08/2025 17:53, Kevin Brodsky wrote:
> On 15/08/2025 10:54, Kevin Brodsky wrote:
>> [...]
>>
>> Performance
>> ===========
>>
>> No arm64 hardware currently implements POE. To estimate the performance
>> impact of kpkeys_hardened_pgtables, a mock implementation of kpkeys has
>> been used, replacing accesses to the POR_EL1 register with accesses to
>> another system register that is otherwise unused (CONTEXTIDR_EL1), and
>> leaving everything else unchanged. Most of the kpkeys overhead is
>> expected to originate from the barrier (ISB) that is required after
>> writing to POR_EL1, and from setting the POIndex (pkey) in page tables;
>> both of these are done exactly in the same way in the mock
>> implementation.
> It turns out this wasn't the case regarding the pkey setting - because
> patch 6 gates set_memory_pkey() on system_supports_poe() and not
> arch_kpkeys_enabled(), the mock implementation turned set_memory_pkey()
> into a no-op. Many thanks to Rick Edgecombe for highlighting that the
> overheads were suspiciously low for some benchmarks!
>
>> The original implementation of kpkeys_hardened_pgtables is very
>> inefficient when many PTEs are changed at once, as the kpkeys level is
>> switched twice for every PTE (two ISBs per PTE). Patch 18 introduces
>> an optimisation that makes use of the lazy_mmu mode to batch those
>> switches: 1. switch to KPKEYS_LVL_PGTABLES on arch_enter_lazy_mmu_mode(),
>> 2. skip any kpkeys switch while in that section, and 3. restore the
>> kpkeys level on arch_leave_lazy_mmu_mode(). When that last function
>> already issues an ISB (when updating kernel page tables), we get a
>> further optimisation as we can skip the ISB when restoring the kpkeys
>> level.
>>
>> Both implementations (without and with batching) were evaluated on an
>> Amazon EC2 M7g instance (Graviton3), using a variety of benchmarks that
>> involve heavy page table manipulations. The results shown below are
>> relative to the baseline for this series, which is 6.17-rc1. The
>> branches used for all three sets of results (baseline, with/without
>> batching) are available in a repository, see next section.
>>
>> Caveat: these numbers should be seen as a lower bound for the overhead
>> of a real POE-based protection. The hardware checks added by POE are
>> however not expected to incur significant extra overhead.
>>
>> Reading example: for the fix_size_alloc_test benchmark, using 1 page per
>> iteration (no hugepage), kpkeys_hardened_pgtables incurs 17.35% overhead
>> without batching, and 14.62% overhead with batching. Both results are
>> considered statistically significant (95% confidence interval),
>> indicated by "(R)".
>>
>> +-------------------+----------------------------------+------------------+---------------+
>> | Benchmark         | Result Class                     | Without batching | 
>> With batching |
>> +===================+==================================+==================+===============+
>> | mmtests/kernbench | real time                        |            0.30% |  
>>        0.11% |
>> |                   | system time                      |        (R) 3.97% |  
>>    (R) 2.17% |
>> |                   | user time                        |            0.12% |  
>>        0.02% |
>> +-------------------+----------------------------------+------------------+---------------+
>> | micromm/fork      | fork: h:0                        |      (R) 217.31% |  
>>       -0.97% |
>> |                   | fork: h:1                        |      (R) 275.25% |  
>>    (R) 2.25% |
>> +-------------------+----------------------------------+------------------+---------------+
>> | micromm/munmap    | munmap: h:0                      |       (R) 15.57% |  
>>       -1.95% |
>> |                   | munmap: h:1                      |      (R) 169.53% |  
>>    (R) 6.53% |
>> +-------------------+----------------------------------+------------------+---------------+
>> | micromm/vmalloc   | fix_size_alloc_test: p:1, h:0    |       (R) 17.35% |  
>>   (R) 14.62% |
>> |                   | fix_size_alloc_test: p:4, h:0    |       (R) 37.54% |  
>>    (R) 9.35% |
>> |                   | fix_size_alloc_test: p:16, h:0   |       (R) 66.08% |  
>>    (R) 3.15% |
>> |                   | fix_size_alloc_test: p:64, h:0   |       (R) 82.94% |  
>>       -0.39% |
>> |                   | fix_size_alloc_test: p:256, h:0  |       (R) 87.85% |  
>>       -1.67% |
>> |                   | fix_size_alloc_test: p:16, h:1   |       (R) 50.31% |  
>>        3.00% |
>> |                   | fix_size_alloc_test: p:64, h:1   |       (R) 59.73% |  
>>        2.23% |
>> |                   | fix_size_alloc_test: p:256, h:1  |       (R) 62.14% |  
>>        1.51% |
>> |                   | random_size_alloc_test: p:1, h:0 |       (R) 77.82% |  
>>       -0.21% |
>> |                   | vm_map_ram_test: p:1, h:0        |       (R) 30.66% |  
>>   (R) 27.30% |
>> +-------------------+----------------------------------+------------------+---------------+
> These numbers therefore correspond to set_memory_pkey() being a no-op,
> in other words they represent the overhead of switching the pkey
> register only.
>
> I have amended the mock implementation so that set_memory_pkey() is run
> as it would on a real POE implementation (i.e. actually setting the PTE
> bits). Here are the new results, representing the overhead of both pkey
> register switching and setting the pkey of page table pages (PTPs) on
> alloc/free:
>
> +-------------------+----------------------------------+------------------+---------------+
> | Benchmark         | Result Class                     | Without
> batching | With batching |
> +===================+==================================+==================+===============+
> | mmtests/kernbench | real time                        |           
> 0.32% |         0.35% |
> |                   | system time                      |        (R)
> 4.18% |     (R) 3.18% |
> |                   | user time                        |           
> 0.08% |         0.20% |
> +-------------------+----------------------------------+------------------+---------------+
> | micromm/fork      | fork: h:0                        |      (R)
> 221.39% |     (R) 3.35% |
> |                   | fork: h:1                        |      (R)
> 282.89% |     (R) 6.99% |
> +-------------------+----------------------------------+------------------+---------------+
> | micromm/munmap    | munmap: h:0                      |       (R)
> 17.37% |        -0.28% |
> |                   | munmap: h:1                      |      (R)
> 172.61% |     (R) 8.08% |
> +-------------------+----------------------------------+------------------+---------------+
> | micromm/vmalloc   | fix_size_alloc_test: p:1, h:0    |       (R)
> 15.54% |    (R) 12.57% |
> |                   | fix_size_alloc_test: p:4, h:0    |       (R)
> 39.18% |     (R) 9.13% |
> |                   | fix_size_alloc_test: p:16, h:0   |       (R)
> 65.81% |         2.97% |
> |                   | fix_size_alloc_test: p:64, h:0   |       (R)
> 83.39% |        -0.49% |
> |                   | fix_size_alloc_test: p:256, h:0  |       (R)
> 87.85% |    (I) -2.04% |
> |                   | fix_size_alloc_test: p:16, h:1   |       (R)
> 51.21% |         3.77% |
> |                   | fix_size_alloc_test: p:64, h:1   |       (R)
> 60.02% |         0.99% |
> |                   | fix_size_alloc_test: p:256, h:1  |       (R)
> 63.82% |         1.16% |
> |                   | random_size_alloc_test: p:1, h:0 |       (R)
> 77.79% |        -0.51% |
> |                   | vm_map_ram_test: p:1, h:0        |       (R)
> 30.67% |    (R) 27.09% |
> +-------------------+----------------------------------+------------------+---------------+


Apologies, Thunderbird helpfully decided to wrap around that table...
Here's the unmangled table:

+-------------------+----------------------------------+------------------+---------------+
| Benchmark         | Result Class                     | Without batching | 
With batching |
+===================+==================================+==================+===============+
| mmtests/kernbench | real time                        |            0.32% |     
    0.35% |
|                   | system time                      |        (R) 4.18% |     
(R) 3.18% |
|                   | user time                        |            0.08% |     
    0.20% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/fork      | fork: h:0                        |      (R) 221.39% |     
(R) 3.35% |
|                   | fork: h:1                        |      (R) 282.89% |     
(R) 6.99% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/munmap    | munmap: h:0                      |       (R) 17.37% |     
   -0.28% |
|                   | munmap: h:1                      |      (R) 172.61% |     
(R) 8.08% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/vmalloc   | fix_size_alloc_test: p:1, h:0    |       (R) 15.54% |    
(R) 12.57% |
|                   | fix_size_alloc_test: p:4, h:0    |       (R) 39.18% |     
(R) 9.13% |
|                   | fix_size_alloc_test: p:16, h:0   |       (R) 65.81% |     
    2.97% |
|                   | fix_size_alloc_test: p:64, h:0   |       (R) 83.39% |     
   -0.49% |
|                   | fix_size_alloc_test: p:256, h:0  |       (R) 87.85% |    
(I) -2.04% |
|                   | fix_size_alloc_test: p:16, h:1   |       (R) 51.21% |     
    3.77% |
|                   | fix_size_alloc_test: p:64, h:1   |       (R) 60.02% |     
    0.99% |
|                   | fix_size_alloc_test: p:256, h:1  |       (R) 63.82% |     
    1.16% |
|                   | random_size_alloc_test: p:1, h:0 |       (R) 77.79% |     
   -0.51% |
|                   | vm_map_ram_test: p:1, h:0        |       (R) 30.67% |    
(R) 27.09% |
+-------------------+----------------------------------+------------------+---------------+

> Those results are overall very similar to the original ones.
> micromm/fork is however clearly impacted - around 4% additional overhead
> from set_memory_pkey(); it makes sense considering that forking requires
> duplicating (and therefore allocating) a full set of page tables.
> kernbench is also a fork-heavy workload and it gets a 1% hit in system
> time (with batching).
>
> It seems fair to conclude that, on arm64, setting the pkey whenever a
> PTP is allocated/freed is not particularly expensive. The situation may
> well be different on x86 as Rick pointed out, and it may also change on
> newer arm64 systems as I noted further down. Allocating/freeing PTPs in
> bulk should help if setting the pkey in the pgtable ctor/dtor proves too
> expensive.
>
> - Kevin
>
>> Benchmarks:
>> - mmtests/kernbench: running kernbench (kernel build) [4].
>> - micromm/{fork,munmap}: from David Hildenbrand's benchmark suite. A
>>   1 GB mapping is created and then fork/unmap is called. The mapping is
>>   created using either page-sized (h:0) or hugepage folios (h:1); in all
>>   cases the memory is PTE-mapped.
>> - micromm/vmalloc: from test_vmalloc.ko, varying the number of pages
>>   (p:) and whether huge pages are used (h:).
>>
>> On a "real-world" and fork-heavy workload like kernbench, the estimated
>> overhead of kpkeys_hardened_pgtables is reasonable: 4% system time
>> overhead without batching, and about half that figure (2.2%) with
>> batching. The real time overhead is negligible.
>>
>> Microbenchmarks show large overheads without batching, which increase
>> with the number of pages being manipulated. Batching drastically reduces
>> that overhead, almost negating it for micromm/fork. Because all PTEs in
>> the mapping are modified in the same lazy_mmu section, the kpkeys level
>> is changed just twice regardless of the mapping size; as a result the
>> relative overhead actually decreases as the size increases for
>> fix_size_alloc_test.
>>
>> Note: the performance impact of set_memory_pkey() is likely to be
>> relatively low on arm64 because the linear mapping uses PTE-level
>> descriptors only. This means that set_memory_pkey() simply changes the
>> attributes of some PTE descriptors. However, some systems may be able to
>> use higher-level descriptors in the future [5], meaning that
>> set_memory_pkey() may have to split mappings. Allocating page tables
>> from a contiguous cache of pages could help minimise the overhead, as
>> proposed for x86 in [1].
>>
>> [...]

Re: [RFC PATCH v5 00/18] pkeys-based page table hardening

Reply via email to