From: Artem Kuzin <[email protected]>
This patchset implements initial support of kernel
text and rodata replication for x86_64 platform.
Linux kernel 6.5.5 is used as a baseline.
There was a work previously published for ARM64 platform
by Russell King (arm64 kernel text replication).
We hope that it will be possible to push this technology forward together.
Current implementation supports next functionality:
1. Replicated kernel text and rodata per-NUMA node
2. Vmalloc is able to work with replicated areas, so
kernel modules text and rodata also replicated during
modules loading stage.
3. BPF handlers are not replicated by default,
but this can be easily done by using existent APIs.
3. KASAN is working except 5-lvl translation table case.
4. KPROBES, KGDB and all functionality that depends on
kernel text patching work without any limitation.
5. KPTI and KASLR fully supported.
6. Replicates parts of translation table related to
replicated text and rodata.
Translation tables synchronization is necessary only in several special cases:
1. Kernel boot
2. Modules deployment
3. Any allocation in user space that require new PUD/P4D
In current design mutable kernel data modifications don't require
synchronization between translation tables due to on 64-bit platforms
all physical memory already mapped in kernel space and this mapping
is persistent.
In user space the translation tables synchronizations are quite rare
due to the only case is new PUD/P4D allocation. Nowadays the only PGD
layer is replicated for user space. Please refer the next pics.
TT overview:
NODE 0 NODE 1
USER KERNEL USER KERNEL
--------------------- ---------------------
PGD | | | | | | | | |*| | | | | | | | | |*|
--------------------- ---------------------
| |
------------------- -------------------
| |
--------------------- ---------------------
PUD | | | | | | | |*|*| | | | | | | | |*|*|
--------------------- ---------------------
| |
------------------- -------------------
| |
--------------------- ---------------------
PMD |READ-ONLY|MUTABLE | |READ-ONLY|MUTABLE |
--------------------- ---------------------
| | | |
| --------------------------
| | |
-------- ------- --------
PHYS | | | | | |
MEM -------- ------- --------
<------> <------>
NODE 0 Shared NODE 1
between
nodes
* - entries unique in each table
TT synchronization:
NODE 0 NODE 1
USER KERNEL USER KERNEL
--------------------- ---------------------
PGD | | |0| | | | | | | | | |0| | | | | | |
--------------------- ---------------------
|
|
|
|
| PUD_ALLOC / P4D_ALLOC
|
| IN USERSPACE
|
\/
--------------------- ---------------------
PGD | | |p| | | | | | | | | |p| | | | | | |
--------------------- ---------------------
| |
| |
---------------------------
|
---------------------
PUD/P4D | | | | | | | | | |
---------------------
Known problems:
1. KASAN is not working in case of 5-lvl translation table.
2. Replication support in vmalloc, possibly, can be optimized in future.
3. Module APIs currently have lack of memory policies support.
This part will be fixed in future.
Preliminary performance evaluation results:
Processor Intel(R) Xeon(R) CPU E5-2690
2 nodes with 12 CPU cores for each one
fork/1 - Time measurements include only one time of invoking this system call.
Measurements are made between entering and exiting the system call.
fork/1024 - The system call is invoked in a loop 1024 times.
The time between entering a loop and exiting it was measured.
mmap/munmap - A set of 1024 pages (if PAGE_SIZE is not defined it is equal to
4096)
was mapped using mmap syscall and unmapped using munmap one.
Every page is mapped/unmapped per a loop iteration.
mmap/lock - The same as above, but in this case flag MAP_LOCKED was added.
open/close - The /dev/null pseudo-file was opened and closed in a loop 1024
times.
It was opened and closed once per iteration.
mount - The pseudo-filesystem procFS was mounted to a temporary directory
inside /tmp only one time.
The time between entering and exiting the system call was measured.
kill - A signal handler for SIGUSR1 was setup. Signal was sent to a child
process,
which was created using fork glibc's wrapper. Time between sending and
receiving
SIGUSR1 signal was measured.
Hot caches:
fork-1 2.3%
fork-1024 10.8%
mmap/munmap 0.4%
mmap/lock 4.2%
open/close 3.2%
kill 4%
mount 8.7%
Cold caches:
fork-1 42.7%
fork-1024 17.1%
mmap/munmap 0.4%
mmap/lock 1.5%
open/close 0.4%
kill 26.1%
mount 4.1%
Artem Kuzin (12):
mm: allow per-NUMA node local PUD/PMD allocation
mm: add config option and per-NUMA node VMS support
mm: per-NUMA node replication core infrastructure
x86: add support of memory protection for NUMA replicas
x86: enable memory protection for replicated memory
x86: align kernel text and rodata using HUGE_PAGE boundary
x86: enable per-NUMA node kernel text and rodata replication
x86: make kernel text patching aware about replicas
x86: add support of NUMA replication for efi page tables
mm: add replicas allocation support for vmalloc
x86: add kernel modules text and rodata replication support
mm: set memory permissions for BPF handlers replicas
arch/x86/include/asm/numa_replication.h | 42 ++
arch/x86/include/asm/pgalloc.h | 10 +
arch/x86/include/asm/set_memory.h | 14 +
arch/x86/kernel/alternative.c | 116 ++---
arch/x86/kernel/kprobes/core.c | 2 +-
arch/x86/kernel/module.c | 35 +-
arch/x86/kernel/smpboot.c | 2 +
arch/x86/kernel/vmlinux.lds.S | 4 +-
arch/x86/mm/dump_pagetables.c | 9 +
arch/x86/mm/fault.c | 4 +-
arch/x86/mm/init.c | 8 +-
arch/x86/mm/init_64.c | 4 +-
arch/x86/mm/pat/set_memory.c | 150 ++++++-
arch/x86/mm/pgtable.c | 76 +++-
arch/x86/mm/pti.c | 2 +-
arch/x86/mm/tlb.c | 30 +-
arch/x86/platform/efi/efi_64.c | 9 +
include/asm-generic/pgalloc.h | 34 ++
include/asm-generic/set_memory.h | 12 +
include/linux/gfp.h | 2 +
include/linux/mm.h | 79 +++-
include/linux/mm_types.h | 11 +-
include/linux/moduleloader.h | 10 +
include/linux/numa_replication.h | 85 ++++
include/linux/set_memory.h | 10 +
include/linux/vmalloc.h | 24 +
init/main.c | 5 +
kernel/bpf/bpf_struct_ops.c | 8 +-
kernel/bpf/core.c | 4 +-
kernel/bpf/trampoline.c | 6 +-
kernel/module/main.c | 8 +
kernel/module/strict_rwx.c | 14 +-
mm/Kconfig | 10 +
mm/Makefile | 1 +
mm/memory.c | 251 ++++++++++-
mm/numa_replication.c | 564 ++++++++++++++++++++++++
mm/page_alloc.c | 18 +
mm/vmalloc.c | 469 ++++++++++++++++----
net/bpf/bpf_dummy_struct_ops.c | 2 +-
39 files changed, 1919 insertions(+), 225 deletions(-)
create mode 100644 arch/x86/include/asm/numa_replication.h
create mode 100644 include/linux/numa_replication.h
create mode 100644 mm/numa_replication.c
--
2.34.1