Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation
On Wed, May 15, 2024 at 01:32:24PM -0700, Sean Christopherson wrote: > On Tue, May 14, 2024, Mickaël Salaün wrote: > > On Fri, May 10, 2024 at 10:07:00AM +, Nicolas Saenz Julienne wrote: > > > Development happens > > > https://github.com/vianpl/{linux,qemu,kvm-unit-tests} and the vsm-next > > > branch, but I'd advice against looking into it until we add some order > > > to the rework. Regardless, feel free to get in touch. > > > > Thanks for the update. > > > > Could we schedule a PUCK meeting to synchronize and help each other? > > What about June 12? > > June 12th works on my end. Can you please send an invite? Mickaël
Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation
On Fri, May 10, 2024 at 10:07:00AM +, Nicolas Saenz Julienne wrote: > On Tue May 7, 2024 at 4:16 PM UTC, Sean Christopherson wrote: > > > If yes, that would indeed require a *lot* of work for something we're not > > > sure will be accepted later on. > > > > Yes and no. The AWS folks are pursuing VSM support in KVM+QEMU, and SVSM > > support > > is trending toward the paired VM+vCPU model. IMO, it's entirely feasible to > > design KVM support such that much of the development load can be shared > > between > > the projects. And having 2+ use cases for a feature (set) makes it _much_ > > more > > likely that the feature(s) will be accepted. > > Since Sean mentioned our VSM efforts, a small update. We were able to > validate the concept of one KVM VM per VTL as discussed in LPC. Right > now only for single CPU guests, but are in the late stages of bringing > up MP support. The resulting KVM code is small, and most will be > uncontroversial (I hope). If other obligations allow it, we plan on > having something suitable for review in the coming months. Looks good! > > Our implementation aims to implement all the VSM spec necessary to run > with Microsoft Credential Guard. But note that some aspects necessary > for HVCI are not covered, especially the ones that depend on MBEC > support, or some categories of secure intercepts. We already implemented support for MBEC, so that should not be an issue. We just need to find the best interface to configure it. > > Development happens > https://github.com/vianpl/{linux,qemu,kvm-unit-tests} and the vsm-next > branch, but I'd advice against looking into it until we add some order > to the rework. Regardless, feel free to get in touch. Thanks for the update. Could we schedule a PUCK meeting to synchronize and help each other? What about June 12?
Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation
On Tue, May 07, 2024 at 09:16:06AM -0700, Sean Christopherson wrote: > On Tue, May 07, 2024, Mickaël Salaün wrote: > > > Actually, potential bad/crazy idea. Why does the _host_ need to define > > > policy? > > > Linux already knows what assets it wants to (un)protect and when. What's > > > missing > > > is a way for the guest kernel to effectively deprivilege and > > > re-authenticate > > > itself as needed. We've been tossing around the idea of paired VMs+vCPUs > > > to > > > support VTLs and SEV's VMPLs, what if we usurped/piggybacked those ideas, > > > with a > > > bit of pKVM mixed in? > > > > > > Borrowing VTL terminology, where VTL0 is the least privileged, userspace > > > launches > > > the VM at VTL0. At some point, the guest triggers the deprivileging > > > sequence and > > > userspace creates VTL1. Userpace also provides a way for VTL0 restrict > > > access to > > > its memory, e.g. to effectively make the page tables for the kernel's > > > direct map > > > writable only from VTL1, to make kernel text RO (or XO), etc. And VTL0 > > > could then > > > also completely remove its access to code that changes CR0/CR4. > > > > > > It would obviously require a _lot_ more upfront work, e.g. to isolate the > > > kernel > > > text that modifies CR0/CR4 so that it can be removed from VTL0, but that > > > should > > > be doable with annotations, e.g. tag relevant functions with __magic or > > > whatever, > > > throw them in a dedicated section, and then free/protect the section(s) > > > at the > > > appropriate time. > > > > > > KVM would likely need to provide the ability to switch VTLs (or whatever > > > they get > > > called), and host userspace would need to provide a decent amount of the > > > backend > > > mechanisms and "core" policies, e.g. to manage VTL0 memory, teardown > > > (turn off?) > > > VTL1 on kexec(), etc. But everything else could live in the guest kernel > > > itself. > > > E.g. to have CR pinning play nice with kexec(), toss the relevant kexec() > > > code into > > > VTL1. That way VTL1 can verify the kexec() target and tear itself down > > > before > > > jumping into the new kernel. > > > > > > This is very off the cuff and have-wavy, e.g. I don't have much of an > > > idea what > > > it would take to harden kernel text patching, but keeping the policy in > > > the guest > > > seems like it'd make everything more tractable than trying to define an > > > ABI > > > between Linux and a VMM that is rich and flexible enough to support all > > > the > > > fancy things Linux does (and will do in the future). > > > > Yes, we agree that the guest needs to manage its own policy. That's why > > we implemented Heki for KVM this way, but without VTLs because KVM > > doesn't support them. > > > > To sum up, is the VTL approach the only one that would be acceptable for > > KVM? > > Heh, that's not a question you want to be asking. You're effectively asking > me > to make an authorative, "final" decision on a topic which I am only passingly > familiar with. > > But since you asked it... :-) Probably? > > I see a lot of advantages to a VTL/VSM-like approach: > > 1. Provides Linux-as-a guest the flexibility it needs to meaningfully advance > its security, with the least amount of policy built into the guest/host > ABI. > > 2. Largely decouples guest policy from the host, i.e. should allow the guest > to > evolve/update it's policy without needing to coordinate changes with the > host. > > 3. The KVM implementation can be generic enough to be reusable for other > features. > > 4. Other groups are already working on VTL-like support in KVM, e.g. for VSM > itself, and potentially for VMPL/SVSM support. > > IMO, #2 is a *huge* selling point. Not having to coordinate changes across > multiple code bases and/or organizations and/or maintainers is a big win for > velocity, long term maintenance, and probably the very viability of HEKI. Agree, this is our goal. > > Providing the guest with the tools to define and implement its own policy > means > end users don't have to way for some third party, e.g. CSPs, to deploy the > accompanying host-side changes, because there are no host-side changes. > > And encapsulating everything in the guest drastically re
Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation
On Mon, May 06, 2024 at 06:34:53PM GMT, Sean Christopherson wrote: > On Mon, May 06, 2024, Mickaël Salaün wrote: > > On Fri, May 03, 2024 at 07:03:21AM GMT, Sean Christopherson wrote: > > > > --- > > > > > > > > Changes since v1: > > > > * New patch. Making user space aware of Heki properties was requested by > > > > Sean Christopherson. > > > > > > No, I suggested having userspace _control_ the pinning[*], not merely be > > > notified > > > of pinning. > > > > > > : IMO, manipulation of protections, both for memory (this patch) and CPU > > > state > > > : (control registers in the next patch) should come from userspace. I > > > have no > > > : objection to KVM providing plumbing if necessary, but I think > > > userspace needs to > > > : to have full control over the actual state. > > > : > > > : One of the things that caused Intel's control register pinning series > > > to stall > > > : out was how to handle edge cases like kexec() and reboot. Deferring > > > to userspace > > > : means the kernel doesn't need to define policy, e.g. when to unprotect > > > memory, > > > : and avoids questions like "should userspace be able to overwrite > > > pinned control > > > : registers". > > > : > > > : And like the confidential VM use case, keeping userspace in the loop > > > is a big > > > : beneifit, e.g. the guest can't circumvent protections by coercing > > > userspace into > > > : writing to protected memory. > > > > > > I stand by that suggestion, because I don't see a sane way to handle > > > things like > > > kexec() and reboot without having a _much_ more sophisticated policy than > > > would > > > ever be acceptable in KVM. > > > > > > I think that can be done without KVM having any awareness of CR pinning > > > whatsoever. > > > E.g. userspace just needs to ability to intercept CR writes and inject > > > #GPs. Off > > > the cuff, I suspect the uAPI could look very similar to MSR filtering. > > > E.g. I bet > > > userspace could enforce MSR pinning without any new KVM uAPI at all. > > > > > > [*] https://lore.kernel.org/all/zfuyhpuhtmbyd...@google.com > > > > OK, I had concern about the control not directly coming from the guest, > > especially in the case of pKVM and confidential computing, but I get you > > Hardware-based CoCo is completely out of scope, because KVM has zero > visibility > into the guest (well, SNP technically allows trapping CR0/CR4, but KVM really > shouldn't intercept CR0/CR4 for SNP guests). > > And more importantly, _KVM_ doesn't define any policies for CoCo VMs. KVM > might > help enforce policies that are defined by hardware/firmware, but KVM doesn't > define any of its own. > > If pKVM on x86 comes along, then KVM will likely get in the business of > defining > policy, but until that happens, KVM needs to stay firmly out of the picture. > > > point. It should indeed be quite similar to the MSR filtering on the > > userspace side, except that we need another interface for the guest to > > request such change (i.e. self-protection). > > > > Would it be OK to keep this new KVM_HC_LOCK_CR_UPDATE hypercall but > > forward the request to userspace with a VM exit instead? That would > > also enable userspace to get the request and directly configure the CR > > pinning with the same VM exit. > > No? Maybe? I strongly suspect that full support will need a richer set of > APIs > than a single hypercall. E.g. to handle kexec(), suspend+resume, emulated > SMM, > and so on and so forth. And that's just for CR pinning. > > And hypercalls are hampered by the fact that VMCALL/VMMCALL don't allow for > delegation or restriction, i.e. there's no way for the guest to communicate to > the hypervisor that a less privileged component is allowed to perform some > action, > nor is there a way for the guest to say some chunk of CPL0 code *isn't* > allowed > to request transition. Delegation and restriction all has to be done > out-of-band. > > It'd probably be more annoying to setup initially, but I think a synthetic > device > with an MMIO-based interface would be more powerful and flexible in the long > run. > Then userspace can evolve without needing to wait for KVM to catch up. > > Actually, potential bad/crazy idea. Why does the _host_ need to define > policy? > Linux alre
Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation
On Fri, May 03, 2024 at 07:03:21AM GMT, Sean Christopherson wrote: > On Fri, May 03, 2024, Mickaël Salaün wrote: > > Add an interface for user space to be notified about guests' Heki policy > > and related violations. > > > > Extend the KVM_ENABLE_CAP IOCTL with KVM_CAP_HEKI_CONFIGURE and > > KVM_CAP_HEKI_DENIAL. Each one takes a bitmask as first argument that can > > contains KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4. The > > returned value is the bitmask of known Heki exit reasons, for now: > > KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4. > > > > If KVM_CAP_HEKI_CONFIGURE is set, a VM exit will be triggered for each > > KVM_HC_LOCK_CR_UPDATE hypercalls according to the requested control > > register. This enables to enlighten the VMM with the guest > > auto-restrictions. > > > > If KVM_CAP_HEKI_DENIAL is set, a VM exit will be triggered for each > > pinned CR violation. This enables the VMM to react to a policy > > violation. > > > > Cc: Borislav Petkov > > Cc: Dave Hansen > > Cc: H. Peter Anvin > > Cc: Ingo Molnar > > Cc: Kees Cook > > Cc: Madhavan T. Venkataraman > > Cc: Paolo Bonzini > > Cc: Sean Christopherson > > Cc: Thomas Gleixner > > Cc: Vitaly Kuznetsov > > Cc: Wanpeng Li > > Signed-off-by: Mickaël Salaün > > Link: https://lore.kernel.org/r/20240503131910.307630-4-...@digikod.net > > --- > > > > Changes since v1: > > * New patch. Making user space aware of Heki properties was requested by > > Sean Christopherson. > > No, I suggested having userspace _control_ the pinning[*], not merely be > notified > of pinning. > > : IMO, manipulation of protections, both for memory (this patch) and CPU > state > : (control registers in the next patch) should come from userspace. I have > no > : objection to KVM providing plumbing if necessary, but I think userspace > needs to > : to have full control over the actual state. > : > : One of the things that caused Intel's control register pinning series to > stall > : out was how to handle edge cases like kexec() and reboot. Deferring to > userspace > : means the kernel doesn't need to define policy, e.g. when to unprotect > memory, > : and avoids questions like "should userspace be able to overwrite pinned > control > : registers". > : > : And like the confidential VM use case, keeping userspace in the loop is a > big > : beneifit, e.g. the guest can't circumvent protections by coercing > userspace into > : writing to protected memory. > > I stand by that suggestion, because I don't see a sane way to handle things > like > kexec() and reboot without having a _much_ more sophisticated policy than > would > ever be acceptable in KVM. > > I think that can be done without KVM having any awareness of CR pinning > whatsoever. > E.g. userspace just needs to ability to intercept CR writes and inject #GPs. > Off > the cuff, I suspect the uAPI could look very similar to MSR filtering. E.g. > I bet > userspace could enforce MSR pinning without any new KVM uAPI at all. > > [*] https://lore.kernel.org/all/zfuyhpuhtmbyd...@google.com OK, I had concern about the control not directly coming from the guest, especially in the case of pKVM and confidential computing, but I get you point. It should indeed be quite similar to the MSR filtering on the userspace side, except that we need another interface for the guest to request such change (i.e. self-protection). Would it be OK to keep this new KVM_HC_LOCK_CR_UPDATE hypercall but forward the request to userspace with a VM exit instead? That would also enable userspace to get the request and directly configure the CR pinning with the same VM exit.
[RFC PATCH v3 5/5] virt: Add Heki KUnit tests
The new CONFIG_HEKI_KUNIT_TEST option enables to run tests in a a kernel module. The minimal required configuration is listed in the virt/heki-test/.kunitconfig file. test_cr_disable_smep checks control-register pinning by trying to disable SMEP. This test should then failed on a non-protected kernel, and only succeed with a kernel protected by Heki. This test doesn't rely on native_write_cr4() because of the cr4_pinned_mask hardening, which means that this *test* module loads a valid kernel code to arbitrary change CR4. This simulate an attack scenario where an attaker would use ROP to directly jump to the related cr4 instruction. As for any KUnit test, the kernel is tainted with TAINT_TEST when the test is executed. It is interesting to create new KUnit tests instead of extending KVM's Kselftests because Heki is design to be hypervisor-agnostic, it relies on a set of hypercalls (for KVM or others), and we also want to test kernel's configuration (actual pinned CR). However, new KVM's Kselftests would be useful to test KVM's interface with the host. When using Qemu, we need to pass the following arguments: -cpu host -enable-kvm For now, it is not possible to run these tests as built-in but we are working on that [1]. If tests are built-in anyway, they will just be skipped because Heki would not be enabled. Run Heki tests with: insmod heki-test.ko KTAP version 1 1..1 KTAP version 1 # Subtest: heki_x86 # module: heki_test 1..1 ok 1 test_cr_disable_smep ok 1 heki_x86 Link: https://lore.kernel.org/r/20240229170409.365386-2-...@digikod.net [1] Signed-off-by: Mickaël Salaün Link: https://lore.kernel.org/r/20240503131910.307630-6-...@digikod.net --- Changes since v2: * Make tests standalone (e.g. don't depends on CONFIG_HEKI). * Enable to create a test kernel module. * Don't rely on private kernel symbols. * Handle GP fault for CR-pinning test case. * Rename option to CONFIG_HEKI_KUNIT_TEST. * Add the list of required kernel options. * Move tests to virt/heki-test/ [FIXME] * Only keep CR pinning test. * Restore previous state (with SMEP enabled). * Add a Kconfig menu for Heki and update the description. * Skip tests if Heki is not protecting the running kernel. Changes since v1: * Move all tests to virt/heki/tests.c --- include/linux/heki.h | 1 + virt/heki/.kunitconfig | 9 virt/heki/Kconfig | 12 + virt/heki/Makefile | 1 + virt/heki/heki-test.c | 114 + virt/heki/main.c | 10 6 files changed, 147 insertions(+) create mode 100644 virt/heki/.kunitconfig create mode 100644 virt/heki/heki-test.c diff --git a/include/linux/heki.h b/include/linux/heki.h index 96ccb17657e5..3294c4d583e5 100644 --- a/include/linux/heki.h +++ b/include/linux/heki.h @@ -35,6 +35,7 @@ struct heki { extern struct heki heki; extern bool heki_enabled; +extern bool heki_enforcing; void heki_early_init(void); void heki_late_init(void); diff --git a/virt/heki/.kunitconfig b/virt/heki/.kunitconfig new file mode 100644 index ..ad4454800579 --- /dev/null +++ b/virt/heki/.kunitconfig @@ -0,0 +1,9 @@ +CONFIG_HEKI=y +CONFIG_HEKI_KUNIT_TEST=m +CONFIG_HEKI_MENU=y +CONFIG_HIGH_RES_TIMERS=y +CONFIG_HYPERVISOR_GUEST=y +CONFIG_KUNIT=y +CONFIG_KVM=y +CONFIG_KVM_GUEST=y +CONFIG_PARAVIRT=y diff --git a/virt/heki/Kconfig b/virt/heki/Kconfig index 0c764e342f48..18895a81a9af 100644 --- a/virt/heki/Kconfig +++ b/virt/heki/Kconfig @@ -28,4 +28,16 @@ config HEKI This feature is helpful in maintaining guest virtual machine security even after the guest kernel has been compromised. +config HEKI_KUNIT_TEST + tristate "KUnit tests for Heki" if !KUNIT_ALL_TESTS + depends on KUNIT + depends on X86 + default KUNIT_ALL_TESTS + help + Build KUnit tests for Landlock. + + See the KUnit documentation in Documentation/dev-tools/kunit + + If you are unsure how to answer this question, answer N. + endif diff --git a/virt/heki/Makefile b/virt/heki/Makefile index 8b10e73a154b..7133545eb5ae 100644 --- a/virt/heki/Makefile +++ b/virt/heki/Makefile @@ -1,3 +1,4 @@ # SPDX-License-Identifier: GPL-2.0-only obj-$(CONFIG_HEKI) += main.o +obj-$(CONFIG_HEKI_KUNIT_TEST) += heki-test.o diff --git a/virt/heki/heki-test.c b/virt/heki/heki-test.c new file mode 100644 index ..b4e11c21ac5d --- /dev/null +++ b/virt/heki/heki-test.c @@ -0,0 +1,114 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Hypervisor Enforced Kernel Integrity (Heki) - Tests + * + * Copyright © 2023-2024 Microsoft Corporation + */ + +#include +#include +#include +#include +#include + +/* Returns true on error (i.e. GP fault), false otherwise. */ +static __always_inline bool set_cr4(unsigned long value) +{ + int err = 0; + + might_sleep(); + /* clang-format off */ + asm volatile("1: mov %[value],%%cr4 \n" +
[RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation
Add an interface for user space to be notified about guests' Heki policy and related violations. Extend the KVM_ENABLE_CAP IOCTL with KVM_CAP_HEKI_CONFIGURE and KVM_CAP_HEKI_DENIAL. Each one takes a bitmask as first argument that can contains KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4. The returned value is the bitmask of known Heki exit reasons, for now: KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4. If KVM_CAP_HEKI_CONFIGURE is set, a VM exit will be triggered for each KVM_HC_LOCK_CR_UPDATE hypercalls according to the requested control register. This enables to enlighten the VMM with the guest auto-restrictions. If KVM_CAP_HEKI_DENIAL is set, a VM exit will be triggered for each pinned CR violation. This enables the VMM to react to a policy violation. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Signed-off-by: Mickaël Salaün Link: https://lore.kernel.org/r/20240503131910.307630-4-...@digikod.net --- Changes since v1: * New patch. Making user space aware of Heki properties was requested by Sean Christopherson. --- arch/x86/kvm/vmx/vmx.c | 5 +- arch/x86/kvm/x86.c | 114 +++ arch/x86/kvm/x86.h | 7 +-- include/linux/kvm_host.h | 2 + include/uapi/linux/kvm.h | 22 5 files changed, 136 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 7ba970b525f7..5869a1ed7866 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5445,6 +5445,7 @@ static int handle_cr(struct kvm_vcpu *vcpu) int reg; int err; int ret; + bool exit = false; exit_qualification = vmx_get_exit_qual(vcpu); cr = exit_qualification & 15; @@ -5454,8 +5455,8 @@ static int handle_cr(struct kvm_vcpu *vcpu) val = kvm_register_read(vcpu, reg); trace_kvm_cr_write(cr, val); - ret = heki_check_cr(vcpu, cr, val); - if (ret) + ret = heki_check_cr(vcpu, cr, val, ); + if (exit) return ret; switch (cr) { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a5f47be59abc..865e88f2b0fc 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -119,6 +119,10 @@ static u64 __read_mostly cr4_reserved_bits = CR4_RESERVED_BITS; #define KVM_CAP_PMU_VALID_MASK KVM_PMU_CAP_DISABLE +#define KVM_HEKI_EXIT_REASON_VALID_MASK ( \ + KVM_HEKI_EXIT_REASON_CR0 | \ + KVM_HEKI_EXIT_REASON_CR4) + #define KVM_X2APIC_API_VALID_FLAGS (KVM_X2APIC_API_USE_32BIT_IDS | \ KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK) @@ -4836,6 +4840,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) if (kvm_is_vm_type_supported(KVM_X86_SW_PROTECTED_VM)) r |= BIT(KVM_X86_SW_PROTECTED_VM); break; + case KVM_CAP_HEKI_CONFIGURE: + case KVM_CAP_HEKI_DENIAL: + r = KVM_HEKI_EXIT_REASON_VALID_MASK; + break; default: break; } @@ -6729,6 +6737,22 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, } mutex_unlock(>lock); break; +#ifdef CONFIG_HEKI + case KVM_CAP_HEKI_CONFIGURE: + r = -EINVAL; + if (cap->args[0] & ~KVM_HEKI_EXIT_REASON_VALID_MASK) + break; + kvm->heki_configure_exit_reason = cap->args[0]; + r = 0; + break; + case KVM_CAP_HEKI_DENIAL: + r = -EINVAL; + if (cap->args[0] & ~KVM_HEKI_EXIT_REASON_VALID_MASK) + break; + kvm->heki_denial_exit_reason = cap->args[0]; + r = 0; + break; +#endif default: r = -EINVAL; break; @@ -8283,11 +8307,60 @@ static unsigned long emulator_get_cr(struct x86_emulate_ctxt *ctxt, int cr) #ifdef CONFIG_HEKI +static int complete_heki_configure_exit(struct kvm_vcpu *const vcpu) +{ + kvm_rax_write(vcpu, 0); + ++vcpu->stat.hypercalls; + return kvm_skip_emulated_instruction(vcpu); +} + +static int complete_heki_denial_exit(struct kvm_vcpu *const vcpu) +{ + kvm_inject_gp(vcpu, 0); + return 1; +} + +/* Returns true if the @exit_reason is handled by @vcpu->kvm. */ +static bool heki_exit_cr(struct kvm_vcpu *const vcpu, const __u32 exit_reason, +const u64 heki_reason, unsigned long value) +{ + switch (exit_reason) { + case KVM_EXIT_HEKI_CONFIGURE: + if (!(vcpu->kvm->heki_configure_exit_reason & heki_reason)) + return false; + +
[RFC PATCH v3 2/5] KVM: x86: Add new hypercall to lock control registers
This enables guests to lock their CR0 and CR4 registers with a subset of X86_CR0_WP, X86_CR4_SMEP, X86_CR4_SMAP, X86_CR4_UMIP, X86_CR4_FSGSBASE and X86_CR4_CET flags. The new KVM_HC_LOCK_CR_UPDATE hypercall takes three arguments. The first is to identify the control register, the second is a bit mask to pin (i.e. mark as read-only), and the third is for optional flags. These register flags should already be pinned by Linux guests, but once compromised, this self-protection mechanism could be disabled, which is not the case with this dedicated hypercall. Once the CRs are pinned by the guest, if it attempts to change them, then a general protection fault is sent to the guest. This hypercall may evolve and support new kind of registers or pinning. The optional KVM_LOCK_CR_UPDATE_VERSION flag enables guests to know the supported abilities by mapping the returned version with the related features. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Signed-off-by: Mickaël Salaün Link: https://lore.kernel.org/r/20240503131910.307630-3-...@digikod.net --- Changes since v1: * Guard KVM_HC_LOCK_CR_UPDATE hypercall with CONFIG_HEKI. * Move extern cr4_pinned_mask to x86.h (suggested by Kees Cook). * Move VMX CR checks from vmx_set_cr*() to handle_cr() to make it possible to return to user space (see next commit). * Change the heki_check_cr()'s first argument to vcpu. * Don't use -KVM_EPERM in heki_check_cr(). * Generate a fault when the guest requests a denied CR update. * Add a flags argument to get the version of this hypercall. Being able to do a preper version check was suggested by Wei Liu. --- Documentation/virt/kvm/x86/hypercalls.rst | 17 + arch/x86/include/uapi/asm/kvm_para.h | 2 + arch/x86/kernel/cpu/common.c | 7 +- arch/x86/kvm/vmx/vmx.c| 5 ++ arch/x86/kvm/x86.c| 84 +++ arch/x86/kvm/x86.h| 22 ++ include/linux/kvm_host.h | 5 ++ include/uapi/linux/kvm_para.h | 1 + 8 files changed, 141 insertions(+), 2 deletions(-) diff --git a/Documentation/virt/kvm/x86/hypercalls.rst b/Documentation/virt/kvm/x86/hypercalls.rst index 10db7924720f..3178576f4c47 100644 --- a/Documentation/virt/kvm/x86/hypercalls.rst +++ b/Documentation/virt/kvm/x86/hypercalls.rst @@ -190,3 +190,20 @@ the KVM_CAP_EXIT_HYPERCALL capability. Userspace must enable that capability before advertising KVM_FEATURE_HC_MAP_GPA_RANGE in the guest CPUID. In addition, if the guest supports KVM_FEATURE_MIGRATION_CONTROL, userspace must also set up an MSR filter to process writes to MSR_KVM_MIGRATION_CONTROL. + +9. KVM_HC_LOCK_CR_UPDATE + + +:Architecture: x86 +:Status: active +:Purpose: Request some control registers to be restricted. + +- a0: identify a control register +- a1: bit mask to make some flags read-only +- a2: optional KVM_LOCK_CR_UPDATE_VERSION flag that will return the version of + this hypercall. Version 1 supports CR0 and CR4 pinning. + +The hypercall lets a guest request control register flags to be pinned for +itself. + +Returns 0 on success or a KVM error code otherwise. diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index a1efa7907a0b..cfc17f3d1877 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -149,4 +149,6 @@ struct kvm_vcpu_pv_apf_data { #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK #define KVM_PV_EOI_DISABLED 0x0 +#define KVM_LOCK_CR_UPDATE_VERSION (1 << 0) + #endif /* _UAPI_ASM_X86_KVM_PARA_H */ diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 605c26c009c8..69695d9d6e2a 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -398,8 +398,11 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c) } /* These bits should not change their value after CPU init is finished. */ -static const unsigned long cr4_pinned_mask = X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP | -X86_CR4_FSGSBASE | X86_CR4_CET | X86_CR4_FRED; +const unsigned long cr4_pinned_mask = X86_CR4_SMEP | X86_CR4_SMAP | + X86_CR4_UMIP | X86_CR4_FSGSBASE | + X86_CR4_CET | X86_CR4_FRED; +EXPORT_SYMBOL_GPL(cr4_pinned_mask); + static DEFINE_STATIC_KEY_FALSE_RO(cr_pinning); static unsigned long cr4_pinned_bits __ro_after_init; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 22411f4aff53..7ba970b525f7 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5453,6 +5453,11 @@ static int handle_cr(struct kvm_vcpu *vcpu) case 0: /* mov to cr */ val = kvm_register_read(vcp
[RFC PATCH v3 1/5] virt: Introduce Hypervisor Enforced Kernel Integrity (Heki)
From: Madhavan T. Venkataraman Hypervisor Enforced Kernel Integrity (Heki) is a feature that will use the hypervisor to enhance guest virtual machine security. Implement minimal code to introduce Heki: - Define the config variables. - Define a kernel command line parameter "heki" to turn the feature on or off. By default, Heki is on. - Define heki_early_init() and call it in start_kernel(). Currently, this function only prints the value of the "heki" command line parameter. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Co-developed-by: Mickaël Salaün Signed-off-by: Mickaël Salaün Signed-off-by: Madhavan T. Venkataraman Link: https://lore.kernel.org/r/20240503131910.307630-2-...@digikod.net --- Changes since v2: * Move CONFIG_HEKI under a new CONFIG_HEKI_MENU to group it with the test configuration (see following patches). * Hide CONFIG_ARCH_SUPPORS_HEKI from users. Changes since v1: * Shrinked this patch to only contain the minimal common parts. * Moved heki_early_init() to start_kernel(). * Use kstrtobool(). --- Kconfig | 2 ++ arch/x86/Kconfig | 1 + include/linux/heki.h | 31 +++ init/main.c | 2 ++ mm/mm_init.c | 1 + virt/Makefile| 1 + virt/heki/Kconfig| 25 + virt/heki/Makefile | 3 +++ virt/heki/common.h | 16 virt/heki/main.c | 33 + 10 files changed, 115 insertions(+) create mode 100644 include/linux/heki.h create mode 100644 virt/heki/Kconfig create mode 100644 virt/heki/Makefile create mode 100644 virt/heki/common.h create mode 100644 virt/heki/main.c diff --git a/Kconfig b/Kconfig index 745bc773f567..0c844d9bcb03 100644 --- a/Kconfig +++ b/Kconfig @@ -29,4 +29,6 @@ source "lib/Kconfig" source "lib/Kconfig.debug" +source "virt/heki/Kconfig" + source "Documentation/Kconfig" diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 928820e61cb5..d2fba63c289b 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -34,6 +34,7 @@ config X86_64 select SWIOTLB select ARCH_HAS_ELFCORE_COMPAT select ZONE_DMA32 + select ARCH_SUPPORTS_HEKI config FORCE_DYNAMIC_FTRACE def_bool y diff --git a/include/linux/heki.h b/include/linux/heki.h new file mode 100644 index ..4c18d2283392 --- /dev/null +++ b/include/linux/heki.h @@ -0,0 +1,31 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Hypervisor Enforced Kernel Integrity (Heki) - Definitions + * + * Copyright © 2023 Microsoft Corporation + */ + +#ifndef __HEKI_H__ +#define __HEKI_H__ + +#include +#include +#include +#include +#include + +#ifdef CONFIG_HEKI + +extern bool heki_enabled; + +void heki_early_init(void); + +#else /* !CONFIG_HEKI */ + +static inline void heki_early_init(void) +{ +} + +#endif /* CONFIG_HEKI */ + +#endif /* __HEKI_H__ */ diff --git a/init/main.c b/init/main.c index 5dcf5274c09c..bec2c8d939aa 100644 --- a/init/main.c +++ b/init/main.c @@ -102,6 +102,7 @@ #include #include #include +#include #include #include @@ -1059,6 +1060,7 @@ void start_kernel(void) uts_ns_init(); key_init(); security_init(); + heki_early_init(); dbg_late_init(); net_ns_init(); vfs_caches_init(); diff --git a/mm/mm_init.c b/mm/mm_init.c index 549e76af8f82..89d9f97bd471 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -27,6 +27,7 @@ #include #include #include +#include #include "internal.h" #include "slab.h" #include "shuffle.h" diff --git a/virt/Makefile b/virt/Makefile index 1cfea9436af9..856b5ccedb5a 100644 --- a/virt/Makefile +++ b/virt/Makefile @@ -1,2 +1,3 @@ # SPDX-License-Identifier: GPL-2.0-only obj-y += lib/ +obj-$(CONFIG_HEKI_MENU) += heki/ diff --git a/virt/heki/Kconfig b/virt/heki/Kconfig new file mode 100644 index ..66e73d212856 --- /dev/null +++ b/virt/heki/Kconfig @@ -0,0 +1,25 @@ +# SPDX-License-Identifier: GPL-2.0 +# +# Hypervisor Enforced Kernel Integrity (Heki) + +config ARCH_SUPPORTS_HEKI + bool + # An architecture should select this when it can successfully build + # and run with CONFIG_HEKI. That is, it should provide all of the + # architecture support required for the HEKI feature. + +menuconfig HEKI_MENU + bool "Virtualization hardening" + +if HEKI_MENU + +config HEKI + bool "Hypervisor Enforced Kernel Integrity (Heki)" + depends on ARCH_SUPPORTS_HEKI + help + This feature enhances guest virtual machine security by taking + advantage of security features provided by the hypervisor for guests. + This feature is helpful in maintaining guest virtual machine security + even
[RFC PATCH v3 0/5] Hypervisor-Enforced Kernel Integrity - CR pinning
to the guest. The guest could then send a signal to the user space process that triggered this policy violation (not implemented). Heki can be enabled with the heki=1 boot command argument. # Similar implementations Here is a non-exhaustive list of similar implementations that we looked at and took some ideas from. Linux mainline doesn't support such security features, let's change that! Windows's Virtualization-Based Security is a proprietary technology that provides a superset of this kind of security mechanism, relying on Hyper-V and Virtual Trust Levels which enables to have light and secure VM enforcing restrictions on a full guest VM. This includes several components such as HVCI for code authenticity, or HyperGuard for monitoring and protecting kernel code and data. Samsung's Real-time Kernel Protection (RKP) and Huawei Hypervisor Execution Environment (HHEE) rely on proprietary hypervisors to protect some Android devices. They monitor critical kernel data (e.g., page tables, credentials, selinux_enforcing). The iOS Kernel Patch Protection (KPP/Watchtower) is a proprietary solution running in EL3 that monitors and protects critical parts of the kernel. It is now replaced with a hardware-based mechanism: KTTR/RoRgn. Bitdefender's Hypervisor Memory Introspection (HVMI) is an open-source (but out of tree) set of components leveraging virtualization. HVMI implementation is very complex, and this approach implies potential semantic gap issues (i.e., kernel data structures may change from one version to another). Linux Kernel Runtime Guard is an open-source kernel module that can detect some kernel data illegitimate modifications. Because it is the same kernel as the compromised one, an attacker could also bypass or disable these checks. Intel's Virtualization Based Hardening [4] [5] is an open-source proof-of-concept of a thin hypervisor dedicated to guest protection. As such, it cannot be used to manage several VMs. # Similar Linux patches Paravirtualized Control Register pinning [3] added a set of KVM IOCTLs to restrict some flags to be set. Heki doesn't implement such user space interface, but only a dedicated hypercall to lock such registers. A superset of these flags is configurable with Heki. The Hypervisor Based Integrity patches [6] [7] only contain a generic IPC mechanism (KVM_HC_UCALL hypercall) to request protection to the VMM. The idea was to extend the KVM_SET_USER_MEMORY_REGION IOCTL to support more permission than read-only. # Current limitations This patch series doesn't handle VM reboot, kexec, nor hybernate yet. We'd like to leverage the realated feature from KVM CR-pinning patch series [3]. Help appreciated! We noticed that the KUnit tests don't work on AMD because the exception table seems to not be properly handled (i.e. a double fault is received). Any reason why this would differ from an Intel's CPU? What about extending register pinning to MSRs? This should first be implemented as a kernel self-protection though. This patch series is also a call for collaboration. There is a lot to do, either on hypervisors, guest kernels or VMMs sides. # Resources You can find related resources, including previous versions, and conference talks about this work and the related LVBS project here: https://github.com/heki-linux [1] https://lore.kernel.org/all/20211006173113.26445-1-ala...@bitdefender.com/ [2] https://www.linux-kvm.org/images/7/72/KVMForum2017_Introspection.pdf [3] https://lore.kernel.org/all/20200617190757.27081-1-john.s.ander...@intel.com/ [4] https://github.com/intel/vbh [5] https://sched.co/TmwN [6] https://sched.co/eE3f [7] https://lore.kernel.org/all/20200501185147.208192-1-yua...@google.com/ Please reach out to us by replying to this thread, we're looking for people to join and collaborate on this project! Previous versions: v2: https://lore.kernel.org/r/20231113022326.24388-1-...@digikod.net v1: https://lore.kernel.org/r/20230505152046.6575-1-...@digikod.net Regards, Madhavan T. Venkataraman (1): virt: Introduce Hypervisor Enforced Kernel Integrity (Heki) Mickaël Salaün (4): KVM: x86: Add new hypercall to lock control registers KVM: x86: Add notifications for Heki policy configuration and violation heki: Lock guest control registers at the end of guest kernel init virt: Add Heki KUnit tests Documentation/virt/kvm/x86/hypercalls.rst | 17 ++ Kconfig | 2 + arch/x86/Kconfig | 1 + arch/x86/include/asm/x86_init.h | 1 + arch/x86/include/uapi/asm/kvm_para.h | 2 + arch/x86/kernel/cpu/common.c | 7 +- arch/x86/kernel/cpu/hypervisor.c | 1 + arch/x86/kernel/kvm.c | 56 +++ arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/vmx/vmx.c| 6 + arch/x86/kvm/x86.c| 180 ++ arch/x86/kvm/x86.h| 23 +++ include
[RFC PATCH v3 4/5] heki: Lock guest control registers at the end of guest kernel init
The hypervisor needs to provide some functions to support Heki. These form the Heki-Hypervisor API. Define a heki_hypervisor structure to house the API functions. A hypervisor that supports Heki must instantiate a heki_hypervisor structure and pass it to the Heki common code. This allows the common code to access these functions in a hypervisor-agnostic way. The first function that is implemented is lock_crs() (lock control registers). That is, certain flags in the control registers are pinned so that they can never be changed for the lifetime of the guest. Implement Heki support in the guest: - Each supported hypervisor in x86 implements a set of functions for the guest kernel. Add an init_heki() function to that set. This function initializes Heki-related stuff. Call init_heki() for the detected hypervisor in init_hypervisor_platform(). - Implement init_heki() for the guest. - Implement kvm_lock_crs() in the guest to lock down control registers. This function calls a KVM hypercall to do the job. - Instantiate a heki_hypervisor structure that contains a pointer to kvm_lock_crs(). - Pass the heki_hypervisor structure to Heki common code in init_heki(). Implement a heki_late_init() function and call it at the end of kernel init. This function calls lock_crs(). In other words, control registers of a guest are locked down at the end of guest kernel init. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Co-developed-by: Madhavan T. Venkataraman Signed-off-by: Madhavan T. Venkataraman Signed-off-by: Mickaël Salaün Link: https://lore.kernel.org/r/20240503131910.307630-5-...@digikod.net --- Changes since v2: * Hide CONFIG_HYPERVISOR_SUPPORTS_HEKI from users. Changes since v1: * Shrinked the patch to only manage the CR pinning. --- arch/x86/include/asm/x86_init.h | 1 + arch/x86/kernel/cpu/hypervisor.c | 1 + arch/x86/kernel/kvm.c| 56 arch/x86/kvm/Kconfig | 1 + include/linux/heki.h | 22 + init/main.c | 1 + virt/heki/Kconfig| 8 - virt/heki/main.c | 25 ++ 8 files changed, 114 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h index 6149eabe200f..113998799473 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -128,6 +128,7 @@ struct x86_hyper_init { bool (*msi_ext_dest_id)(void); void (*init_mem_mapping)(void); void (*init_after_bootmem)(void); + void (*init_heki)(void); }; /** diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c index 553bfbfc3a1b..6085c8129e0c 100644 --- a/arch/x86/kernel/cpu/hypervisor.c +++ b/arch/x86/kernel/cpu/hypervisor.c @@ -106,4 +106,5 @@ void __init init_hypervisor_platform(void) x86_hyper_type = h->type; x86_init.hyper.init_platform(); + x86_init.hyper.init_heki(); } diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 7f0732bc0ccd..a54f2c0d7cd0 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -999,6 +1000,60 @@ static bool kvm_sev_es_hcall_finish(struct ghcb *ghcb, struct pt_regs *regs) } #endif +#ifdef CONFIG_HEKI + +extern unsigned long cr4_pinned_mask; + +/* + * TODO: Check SMP policy consistency, e.g. with + * this_cpu_read(cpu_tlbstate.cr4) + */ +static int kvm_lock_crs(void) +{ + unsigned long cr4; + int err; + + err = kvm_hypercall3(KVM_HC_LOCK_CR_UPDATE, 0, X86_CR0_WP, 0); + if (err) + return err; + + cr4 = __read_cr4(); + err = kvm_hypercall3(KVM_HC_LOCK_CR_UPDATE, 4, cr4 & cr4_pinned_mask, +0); + return err; +} + +static struct heki_hypervisor kvm_heki_hypervisor = { + .lock_crs = kvm_lock_crs, +}; + +static void kvm_init_heki(void) +{ + long err; + + if (!kvm_para_available()) { + /* Cannot make KVM hypercalls. */ + return; + } + + err = kvm_hypercall3(KVM_HC_LOCK_CR_UPDATE, 0, 0, +KVM_LOCK_CR_UPDATE_VERSION); + if (err < 1) { + /* Ignores host not supporting at least the first version. */ + return; + } + + heki.hypervisor = _heki_hypervisor; +} + +#else /* CONFIG_HEKI */ + +static void kvm_init_heki(void) +{ +} + +#endif /* CONFIG_HEKI */ + const __initconst struct hypervisor_x86 x86_hyper_kvm = { .name = "KVM", .detect = kvm_detect, @@ -1007,6 +1062,7 @@ const __initconst struct hypervisor_x86 x86_hyper_kvm = { .i
[RFC PATCH v2 00/19] Hypervisor-Enforced Kernel Integrity
this as well. We currently use static address ranges to configure protections at boot (see heki_arch_early_init). This is not compatible with KASLR yet, but this will be handled in a next patch series. Because the guest's virtual address translation is not protected by the hypervisor, a compromised kernel could map existing physical pages into arbitrary virtual addresses. The new Intel's Hypervisor-Managed Linear Address Translation [10] (HLAT) could be used to extend the current protection and cover this case. ROP is not covered by this patch series. Guest kernels can still jump to arbitrary executable pages according to their control-flow integrity protection. # Future work New dynamic restrictions could enable to improve the protected data by including security-sensitive data such as LSM states, seccomp filters, keyrings... This requires support outside of the hypervisor. An execute-only mode could also be useful (cf. XOM for KVM [11] [12]). Extending register pinning (e.g., MSRs). For now, MBEC is only supported on a bare metal machine as KVM host; nested virtualization is not supported yet. Being able to protect nested guests might be possible but we need to figure out the potential security implications. Protecting the host would be useful, but that doesn't really fit with the KVM model. The Protected KVM project is a first step to help in this direction [13]. We only tested this with an Intel CPU, but this approach should work the same with an AMD CPU starting with the Zen 2 generation and their Guest Mode Execute Trap (GMET) capability. We also kept some TODOs to highlight missing checks and code sharing issues, and some pr_warn() calls to help understand how it works. Tests need to be improved (e.g., invalid hypercall arguments). We'll present this work at the Linux Plumbers Conference next week. [1] https://lore.kernel.org/all/20211006173113.26445-1-ala...@bitdefender.com/ [2] https://www.linux-kvm.org/images/7/72/KVMForum2017_Introspection.pdf [3] https://lore.kernel.org/all/20200617190757.27081-1-john.s.ander...@intel.com/ [4] https://github.com/kvm-x86/linux [5] https://lore.kernel.org/all/20231027182217.3615211-1-sea...@google.com/ [6] https://github.com/intel/vbh [7] https://sched.co/TmwN [8] https://sched.co/eE3f [9] https://lore.kernel.org/all/20200501185147.208192-1-yua...@google.com/ [10] https://sched.co/eE4F [11] https://lore.kernel.org/kvm/20191003212400.31130-1-rick.p.edgeco...@intel.com/ [12] https://lpc.events/event/4/contributions/283/ [13] https://sched.co/eE24 Please reach out to us by replying to this thread, we're looking for people to join and collaborate on this project! Previous version: v1: https://lore.kernel.org/r/20230505152046.6575-1-...@digikod.net Regards, Madhavan T. Venkataraman (9): virt: Introduce Hypervisor Enforced Kernel Integrity (Heki) KVM: x86: Add new hypercall to set EPT permissions x86: Implement the Memory Table feature to store arbitrary per-page data heki: Implement a kernel page table walker heki: x86: Initialize permissions counters for pages mapped into KVA heki: x86: Initialize permissions counters for pages in vmap()/vunmap() heki: x86: Update permissions counters when guest page permissions change heki: x86: Update permissions counters during text patching heki: x86: Protect guest kernel memory using the KVM hypervisor Mickaël Salaün (10): KVM: x86: Add new hypercall to lock control registers KVM: x86: Add notifications for Heki policy configuration and violation heki: Lock guest control registers at the end of guest kernel init KVM: VMX: Add MBEC support KVM: x86: Add kvm_x86_ops.fault_gva() KVM: x86: Make memory attribute helpers more generic KVM: x86: Extend kvm_vm_set_mem_attributes() with a mask KVM: x86: Extend kvm_range_has_memory_attributes() with match_all KVM: x86: Implement per-guest-page permissions virt: Add Heki KUnit tests Documentation/virt/kvm/x86/hypercalls.rst | 31 +++ Kconfig | 2 + arch/x86/Kconfig | 1 + arch/x86/include/asm/kvm-x86-ops.h| 1 + arch/x86/include/asm/kvm_host.h | 2 + arch/x86/include/asm/vmx.h| 11 +- arch/x86/include/asm/x86_init.h | 1 + arch/x86/include/uapi/asm/kvm_para.h | 2 + arch/x86/kernel/alternative.c | 5 + arch/x86/kernel/cpu/common.c | 4 +- arch/x86/kernel/cpu/hypervisor.c | 1 + arch/x86/kernel/kvm.c | 67 + arch/x86/kernel/setup.c | 2 + arch/x86/kvm/Kconfig | 2 + arch/x86/kvm/Makefile | 4 +- arch/x86/kvm/mmu.h| 3 +- arch/x86/kvm/mmu/mmu.c| 114 ++-- arch/x86/kvm/mmu/mmutrace.h | 11 +- arch/x86/kvm/mmu/paging_tmpl.h| 19 +- arch/x86/kvm/mmu/spte.c | 19 +- arch/x86/kvm/mmu
[RFC PATCH v2 13/19] heki: Implement a kernel page table walker
From: Madhavan T. Venkataraman The Heki feature needs to do the following: - Find kernel mappings. - Determine the permissions associated with each mapping. - Determine the collective permissions for a guest physical page across all of its mappings. This way, a guest physical page can reflect only the required permissions in the EPT thanks to the KVM_HC_PROTECT_MEMORY hypercall.. Implement a kernel page table walker that walks all of the kernel mappings and calls a callback function for each mapping. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Co-developed-by: Mickaël Salaün Signed-off-by: Mickaël Salaün Signed-off-by: Madhavan T. Venkataraman --- Change since v1: * New patch and new file: virt/heki/walk.c --- include/linux/heki.h | 16 + virt/heki/Makefile | 1 + virt/heki/walk.c | 140 +++ 3 files changed, 157 insertions(+) create mode 100644 virt/heki/walk.c diff --git a/include/linux/heki.h b/include/linux/heki.h index 9b0c966c50d1..a7ae0b387dfe 100644 --- a/include/linux/heki.h +++ b/include/linux/heki.h @@ -61,6 +61,22 @@ struct heki { struct heki_hypervisor *hypervisor; }; +/* + * The kernel page table is walked to locate kernel mappings. For each + * mapping, a callback function is called. The table walker passes information + * about the mapping to the callback using this structure. + */ +struct heki_args { + /* Information passed by the table walker to the callback. */ + unsigned long va; + phys_addr_t pa; + size_t size; + unsigned long flags; +}; + +/* Callback function called by the table walker. */ +typedef void (*heki_func_t)(struct heki_args *args); + extern struct heki heki; extern bool heki_enabled; diff --git a/virt/heki/Makefile b/virt/heki/Makefile index 354e567df71c..a5daa4ff7a4f 100644 --- a/virt/heki/Makefile +++ b/virt/heki/Makefile @@ -1,3 +1,4 @@ # SPDX-License-Identifier: GPL-2.0-only obj-y += main.o +obj-y += walk.o diff --git a/virt/heki/walk.c b/virt/heki/walk.c new file mode 100644 index ..e10b54226fcc --- /dev/null +++ b/virt/heki/walk.c @@ -0,0 +1,140 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Hypervisor Enforced Kernel Integrity (Heki) - Kernel page table walker. + * + * Copyright © 2023 Microsoft Corporation + * + * Cf. arch/x86/mm/init_64.c + */ + +#include +#include + +static void heki_walk_pte(pmd_t *pmd, unsigned long va, unsigned long va_end, + heki_func_t func, struct heki_args *args) +{ + pte_t *pte; + unsigned long next_va; + + for (pte = pte_offset_kernel(pmd, va); va < va_end; +va = next_va, pte++) { + next_va = (va + PAGE_SIZE) & PAGE_MASK; + + if (next_va > va_end) + next_va = va_end; + + if (!pte_present(*pte)) + continue; + + args->va = va; + args->pa = pte_pfn(*pte) << PAGE_SHIFT; + args->size = PAGE_SIZE; + args->flags = pte_flags(*pte); + + func(args); + } +} + +static void heki_walk_pmd(pud_t *pud, unsigned long va, unsigned long va_end, + heki_func_t func, struct heki_args *args) +{ + pmd_t *pmd; + unsigned long next_va; + + for (pmd = pmd_offset(pud, va); va < va_end; va = next_va, pmd++) { + next_va = pmd_addr_end(va, va_end); + + if (!pmd_present(*pmd)) + continue; + + if (pmd_large(*pmd)) { + args->va = va; + args->pa = pmd_pfn(*pmd) << PAGE_SHIFT; + args->pa += va & (PMD_SIZE - 1); + args->size = next_va - va; + args->flags = pmd_flags(*pmd); + + func(args); + } else { + heki_walk_pte(pmd, va, next_va, func, args); + } + } +} + +static void heki_walk_pud(p4d_t *p4d, unsigned long va, unsigned long va_end, + heki_func_t func, struct heki_args *args) +{ + pud_t *pud; + unsigned long next_va; + + for (pud = pud_offset(p4d, va); va < va_end; va = next_va, pud++) { + next_va = pud_addr_end(va, va_end); + + if (!pud_present(*pud)) + continue; + + if (pud_large(*pud)) { + args->va = va; + args->pa = pud_pfn(*pud) << PAGE_SHIFT; + args->pa += va & (PUD_SIZE - 1); + args->size = next_va - va; + args->flags = pud_flags(*pud); + +
[RFC PATCH v2 12/19] x86: Implement the Memory Table feature to store arbitrary per-page data
From: Madhavan T. Venkataraman This feature can be used by a consumer to associate any arbitrary pointer with a physical page. The feature implements a page table format that mirrors the hardware page table. A leaf entry in the table points to consumer data for that page. The page table format has these advantages: - The format allows for a sparse representation. This is useful since the physical address space can be large and is typically sparsely populated in a system. - A consumer of this feature can choose to populate data just for the pages he is interested in. - Information can be stored for large pages, if a consumer wishes. For instance, for Heki, the guest kernel uses this to create permissions counters for each guest physical page. The permissions counters reflects the collective permissions for a guest physical page across all mappings to that page. This allows the guest to request the hypervisor to set only the necessary permissions for a guest physical page in the EPT (instead of RWX). This feature could also be used to improve the KVM's memory attribute and the write page tracking. We will support large page entries in mem_table in a future version thanks to extra mem_table_ops's merge() and split() operations. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Mickaël Salaün Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Signed-off-by: Madhavan T. Venkataraman --- Changes since v1: * New patch and new file: kernel/mem_table.c --- arch/x86/kernel/setup.c | 2 + include/linux/heki.h | 1 + include/linux/mem_table.h | 55 ++ kernel/Makefile | 2 + kernel/mem_table.c| 219 ++ 5 files changed, 279 insertions(+) create mode 100644 include/linux/mem_table.h create mode 100644 kernel/mem_table.c diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index b098b1fa2470..e7ae46953ae4 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -25,6 +25,7 @@ #include #include #include +#include #include @@ -1315,6 +1316,7 @@ void __init setup_arch(char **cmdline_p) #endif unwind_init(); + mem_table_init(PG_LEVEL_4K); } #ifdef CONFIG_X86_32 diff --git a/include/linux/heki.h b/include/linux/heki.h index 89cc9273a968..9b0c966c50d1 100644 --- a/include/linux/heki.h +++ b/include/linux/heki.h @@ -15,6 +15,7 @@ #include #include #include +#include #ifdef CONFIG_HEKI diff --git a/include/linux/mem_table.h b/include/linux/mem_table.h new file mode 100644 index ..738bf12309f3 --- /dev/null +++ b/include/linux/mem_table.h @@ -0,0 +1,55 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Memory table feature - Definitions. + * + * Copyright © 2023 Microsoft Corporation. + */ + +#ifndef __MEM_TABLE_H__ +#define __MEM_TABLE_H__ + +/* clang-format off */ + +/* + * The MEM_TABLE bit is set on entries that point to an intermediate table. + * So, this bit is reserved. This means that pointers to consumer data must + * be at least two-byte aligned (so the MEM_TABLE bit is 0). + */ +#define MEM_TABLE BIT(0) +#define IS_LEAF(entry) !((uintptr_t)entry & MEM_TABLE) + +/* clang-format on */ + +/* + * A memory table is arranged exactly like a page table. The memory table + * configuration reflects the hardware page table configuration. + */ + +/* Parameters at each level of the memory table hierarchy. */ +struct mem_table_level { + unsigned int number; + unsigned int nentries; + unsigned int shift; + unsigned int mask; +}; + +struct mem_table { + struct mem_table_level *level; + struct mem_table_ops *ops; + bool changed; + void *entries[]; +}; + +/* Operations that need to be supplied by a consumer of memory tables. */ +struct mem_table_ops { + void (*free)(void *buf); +}; + +void mem_table_init(unsigned int base_level); +struct mem_table *mem_table_alloc(struct mem_table_ops *ops); +void mem_table_free(struct mem_table *table); +void **mem_table_create(struct mem_table *table, phys_addr_t pa); +void **mem_table_find(struct mem_table *table, phys_addr_t pa, + unsigned int *level_num); + +#endif /* __MEM_TABLE_H__ */ diff --git a/kernel/Makefile b/kernel/Makefile index 3947122d618b..dcef03ec5c54 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -131,6 +131,8 @@ obj-$(CONFIG_WATCH_QUEUE) += watch_queue.o obj-$(CONFIG_RESOURCE_KUNIT_TEST) += resource_kunit.o obj-$(CONFIG_SYSCTL_KUNIT_TEST) += sysctl-test.o +obj-$(CONFIG_SPARSEMEM) += mem_table.o + CFLAGS_stackleak.o += $(DISABLE_STACKLEAK_PLUGIN) obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += stackleak.o KASAN_SANITIZE_stackleak.o := n diff --git a/kernel/mem_table.c b/kernel/mem_table.c new file mode 100644 index ..280a1b5ddde0 --- /dev/null +++ b/kernel/mem_tab
[RFC PATCH v2 08/19] KVM: x86: Extend kvm_vm_set_mem_attributes() with a mask
Enable to only update a subset of attributes. This is needed to be able to use the XArray for different use cases and make sure they don't interfere (see a following commit). Cc: Chao Peng Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Sean Christopherson Cc: Yu Zhang Signed-off-by: Mickaël Salaün --- Changes since v1: * New patch --- arch/x86/kvm/mmu/mmu.c | 2 +- include/linux/kvm_host.h | 2 +- virt/kvm/kvm_main.c | 27 +++ 3 files changed, 21 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 4d378d308762..d7010e09440d 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -7283,7 +7283,7 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot, for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) { if (hugepage_test_mixed(slot, gfn, level - 1) || - attrs != kvm_get_memory_attributes(kvm, gfn)) + !(attrs & kvm_get_memory_attributes(kvm, gfn))) return false; } return true; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 85b8648fd892..de68390ab0f2 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2397,7 +2397,7 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm, bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, struct kvm_gfn_range *range); int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, - unsigned long attributes); + unsigned long attributes, unsigned long mask); static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn) { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 0096ccfbb609..e2c178db17d5 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2436,7 +2436,7 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm, #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES /* * Returns true if _all_ gfns in the range [@start, @end) have attributes - * matching @attrs. + * matching the @attrs bitmask. */ bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end, unsigned long attrs) @@ -2459,7 +2459,8 @@ bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end, entry = xas_next(); } while (xas_retry(, entry)); - if (xas.xa_index != index || xa_to_value(entry) != attrs) { + if (xas.xa_index != index || + (xa_to_value(entry) & attrs) != attrs) { has_attrs = false; break; } @@ -2553,7 +2554,7 @@ static bool kvm_pre_set_memory_attributes(struct kvm *kvm, /* Set @attributes for the gfn range [@start, @end). */ int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, -unsigned long attributes) + unsigned long attributes, unsigned long mask) { struct kvm_mmu_notifier_range pre_set_range = { .start = start, @@ -2572,11 +2573,8 @@ int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, .may_block = true, }; unsigned long i; - void *entry; int r = 0; - entry = attributes ? xa_mk_value(attributes) : NULL; - lockdep_assert_held(>slots_arch_lock); /* Nothing to do if the entire range as the desired attributes. */ @@ -2596,6 +2594,16 @@ int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, kvm_handle_gfn_range(kvm, _set_range); for (i = start; i < end; i++) { + unsigned long value = 0; + void *entry; + + entry = xa_load(>mem_attr_array, i); + if (xa_is_value(entry)) + value = xa_to_value(entry) & ~mask; + + value |= attributes & mask; + entry = value ? xa_mk_value(value) : NULL; + r = xa_err(xa_store(>mem_attr_array, i, entry, GFP_KERNEL_ACCOUNT)); KVM_BUG_ON(r, kvm); @@ -2609,12 +2617,14 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm, struct kvm_memory_attributes *attrs) { int r; + unsigned long attrs_mask; gfn_t start, end; /* flags is currently not used. */ if (attrs->flags) return -EINVAL; - if (attrs->attributes & ~kvm_supported_mem_attributes(kvm)) + attrs_mask = kvm_supported_mem_attributes(kvm); + if (attrs->attributes & ~attrs_mask) return -EINVAL; if (attrs->size == 0 || attrs->address + attrs->size < attrs->addres
[RFC PATCH v2 09/19] KVM: x86: Extend kvm_range_has_memory_attributes() with match_all
This enables to check if an attribute is tied to any memory page in a range. This will be useful in a folling commit to check for KVM_MEMORY_ATTRIBUTE_HEKI_IMMUTABLE. Cc: Chao Peng Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Sean Christopherson Cc: Yu Zhang Signed-off-by: Mickaël Salaün --- Changes since v1: * New patch --- arch/x86/kvm/mmu/mmu.c | 2 +- include/linux/kvm_host.h | 2 +- virt/kvm/kvm_main.c | 27 ++- 3 files changed, 20 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index d7010e09440d..2024ff21d036 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -7279,7 +7279,7 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot, const unsigned long end = start + KVM_PAGES_PER_HPAGE(level); if (level == PG_LEVEL_2M) - return kvm_range_has_memory_attributes(kvm, start, end, attrs); + return kvm_range_has_memory_attributes(kvm, start, end, attrs, true); for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) { if (hugepage_test_mixed(slot, gfn, level - 1) || diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index de68390ab0f2..9ecb016a336f 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2391,7 +2391,7 @@ static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn } bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end, -unsigned long attrs); +unsigned long attrs, bool match_all); bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e2c178db17d5..67dbaaf40c1c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2435,11 +2435,11 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm, #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES /* - * Returns true if _all_ gfns in the range [@start, @end) have attributes - * matching the @attrs bitmask. + * According to @match_all, returns true if _all_ (respectively _any_) gfns in + * the range [@start, @end) have attributes matching the @attrs bitmask. */ bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end, -unsigned long attrs) +unsigned long attrs, bool match_all) { XA_STATE(xas, >mem_attr_array, start); unsigned long index; @@ -2453,16 +2453,25 @@ bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end, goto out; } - has_attrs = true; + has_attrs = match_all; for (index = start; index < end; index++) { do { entry = xas_next(); } while (xas_retry(, entry)); - if (xas.xa_index != index || - (xa_to_value(entry) & attrs) != attrs) { - has_attrs = false; - break; + if (match_all) { + if (xas.xa_index != index || + (xa_to_value(entry) & attrs) != attrs) { + has_attrs = false; + break; + } + } else { + index = xas.xa_index; + if (index < end && + (xa_to_value(entry) & attrs) == attrs) { + has_attrs = true; + break; + } } } @@ -2578,7 +2587,7 @@ int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, lockdep_assert_held(>slots_arch_lock); /* Nothing to do if the entire range as the desired attributes. */ - if (kvm_range_has_memory_attributes(kvm, start, end, attributes)) + if (kvm_range_has_memory_attributes(kvm, start, end, attributes, true)) return r; /* -- 2.42.1
[RFC PATCH v2 18/19] heki: x86: Protect guest kernel memory using the KVM hypervisor
From: Madhavan T. Venkataraman Implement a hypervisor function, kvm_protect_memory() that calls the KVM_HC_PROTECT_MEMORY hypercall to request the KVM hypervisor to set specified permissions on a list of guest pages. Using the protect_memory() function, set proper EPT permissions for all guest pages. Use the MEM_ATTR_IMMUTABLE property to protect the kernel static sections and the boot-time read-only sections. This enables to make sure a compromised guest will not be able to change its main physical memory page permissions. However, this also disable any feature that may change the kernel's text section (e.g., ftrace, Kprobes), but they can still be used on kernel modules. Module loading/unloading, and eBPF JIT is allowed without restrictions for now, but we'll need a way to authenticate these code changes to really improve the guests' security. We plan to use module signatures, but there is no solution yet to authenticate eBPF programs. Being able to use ftrace and Kprobes in a secure way is a challenge not solved yet. We're looking for ideas to make this work. Likewise, the JUMP_LABEL feature cannot work because the kernel's text section is read-only. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Co-developed-by: Mickaël Salaün Signed-off-by: Mickaël Salaün Signed-off-by: Madhavan T. Venkataraman --- Changes since v1: * New patch --- arch/x86/kernel/kvm.c | 11 ++ arch/x86/kvm/mmu/mmu.c | 2 +- arch/x86/mm/heki.c | 21 ++ include/linux/heki.h | 26 virt/heki/Kconfig | 1 + virt/heki/counters.c | 90 -- virt/heki/main.c | 83 +- 7 files changed, 229 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 8349f4ad3bbd..343615b0e3bf 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -1021,8 +1021,19 @@ static int kvm_lock_crs(void) return err; } +static int kvm_protect_memory(gpa_t pa) +{ + long err; + + WARN_ON_ONCE(in_interrupt()); + + err = kvm_hypercall1(KVM_HC_PROTECT_MEMORY, pa); + return err; +} + static struct heki_hypervisor kvm_heki_hypervisor = { .lock_crs = kvm_lock_crs, + .protect_memory = kvm_protect_memory, }; static void kvm_init_heki(void) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 2d09bcc35462..13be05e9ccf1 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -7374,7 +7374,7 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, int level; lockdep_assert_held_write(>mmu_lock); - lockdep_assert_held(>slots_lock); + lockdep_assert_held(>slots_arch_lock); /* * The sequence matters here: upper levels consume the result of lower diff --git a/arch/x86/mm/heki.c b/arch/x86/mm/heki.c index e4c60d8b4f2d..6c3fa9defada 100644 --- a/arch/x86/mm/heki.c +++ b/arch/x86/mm/heki.c @@ -45,6 +45,19 @@ __init void heki_arch_early_init(void) heki_map(direct_map_end, kernel_end); } +void heki_arch_late_init(void) +{ + /* +* The permission counters for all existing kernel mappings have +* already been updated. Now, walk all the pages, compute their +* permissions from the counters and apply the permissions in the +* host page table. To accomplish this, we walk the direct map +* range. +*/ + heki_protect(direct_map_va, direct_map_end); + pr_warn("Guest memory protected\n"); +} + unsigned long heki_flags_to_permissions(unsigned long flags) { unsigned long permissions; @@ -67,6 +80,11 @@ void heki_pgprot_to_permissions(pgprot_t prot, unsigned long *set, *clear |= MEM_ATTR_EXEC; } +unsigned long heki_default_permissions(void) +{ + return MEM_ATTR_READ | MEM_ATTR_WRITE; +} + static unsigned long heki_pgprot_to_flags(pgprot_t prot) { unsigned long flags = 0; @@ -100,6 +118,9 @@ static void heki_text_poke_common(struct page **pages, int npages, heki_callback(); } + if (args.head) + heki_apply_permissions(); + mutex_unlock(_lock); } diff --git a/include/linux/heki.h b/include/linux/heki.h index 6f2cfddc6dac..306bcec7ae92 100644 --- a/include/linux/heki.h +++ b/include/linux/heki.h @@ -15,6 +15,8 @@ #include #include #include +#include +#include #include #ifdef CONFIG_HEKI @@ -61,6 +63,7 @@ struct heki_page_list { */ struct heki_hypervisor { int (*lock_crs)(void); /* Lock control registers. */ + int (*protect_memory)(gpa_t pa); /* Protect guest memory */ }; /* @@ -74,16 +77,28 @@ struct heki_hypervisor { * - a page is mapped into the kernel address space * - a pa
[RFC PATCH v2 01/19] virt: Introduce Hypervisor Enforced Kernel Integrity (Heki)
From: Madhavan T. Venkataraman Hypervisor Enforced Kernel Integrity (Heki) is a feature that will use the hypervisor to enhance guest virtual machine security. Implement minimal code to introduce Heki: - Define the config variables. - Define a kernel command line parameter "heki" to turn the feature on or off. By default, Heki is on. - Define heki_early_init() and call it in start_kernel(). Currently, this function only prints the value of the "heki" command line parameter. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Co-developed-by: Mickaël Salaün Signed-off-by: Mickaël Salaün Signed-off-by: Madhavan T. Venkataraman --- Changes since v1: * Shrinked this patch to only contain the minimal common parts. * Moved heki_early_init() to start_kernel(). --- Kconfig | 2 ++ arch/x86/Kconfig | 1 + include/linux/heki.h | 31 +++ init/main.c | 2 ++ mm/mm_init.c | 1 + virt/Makefile| 1 + virt/heki/Kconfig| 19 +++ virt/heki/Makefile | 3 +++ virt/heki/common.h | 16 virt/heki/main.c | 32 10 files changed, 108 insertions(+) create mode 100644 include/linux/heki.h create mode 100644 virt/heki/Kconfig create mode 100644 virt/heki/Makefile create mode 100644 virt/heki/common.h create mode 100644 virt/heki/main.c diff --git a/Kconfig b/Kconfig index 745bc773f567..0c844d9bcb03 100644 --- a/Kconfig +++ b/Kconfig @@ -29,4 +29,6 @@ source "lib/Kconfig" source "lib/Kconfig.debug" +source "virt/heki/Kconfig" + source "Documentation/Kconfig" diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 66bfabae8814..424f949442bd 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -35,6 +35,7 @@ config X86_64 select SWIOTLB select ARCH_HAS_ELFCORE_COMPAT select ZONE_DMA32 + select ARCH_SUPPORTS_HEKI config FORCE_DYNAMIC_FTRACE def_bool y diff --git a/include/linux/heki.h b/include/linux/heki.h new file mode 100644 index ..4c18d2283392 --- /dev/null +++ b/include/linux/heki.h @@ -0,0 +1,31 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Hypervisor Enforced Kernel Integrity (Heki) - Definitions + * + * Copyright © 2023 Microsoft Corporation + */ + +#ifndef __HEKI_H__ +#define __HEKI_H__ + +#include +#include +#include +#include +#include + +#ifdef CONFIG_HEKI + +extern bool heki_enabled; + +void heki_early_init(void); + +#else /* !CONFIG_HEKI */ + +static inline void heki_early_init(void) +{ +} + +#endif /* CONFIG_HEKI */ + +#endif /* __HEKI_H__ */ diff --git a/init/main.c b/init/main.c index 436d73261810..0d28301c5402 100644 --- a/init/main.c +++ b/init/main.c @@ -99,6 +99,7 @@ #include #include #include +#include #include #include @@ -1047,6 +1048,7 @@ void start_kernel(void) uts_ns_init(); key_init(); security_init(); + heki_early_init(); dbg_late_init(); net_ns_init(); vfs_caches_init(); diff --git a/mm/mm_init.c b/mm/mm_init.c index 50f2f34745af..896977383cc3 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -26,6 +26,7 @@ #include #include #include +#include #include "internal.h" #include "slab.h" #include "shuffle.h" diff --git a/virt/Makefile b/virt/Makefile index 1cfea9436af9..4550dc624466 100644 --- a/virt/Makefile +++ b/virt/Makefile @@ -1,2 +1,3 @@ # SPDX-License-Identifier: GPL-2.0-only obj-y += lib/ +obj-$(CONFIG_HEKI) += heki/ diff --git a/virt/heki/Kconfig b/virt/heki/Kconfig new file mode 100644 index ..49695fff6d21 --- /dev/null +++ b/virt/heki/Kconfig @@ -0,0 +1,19 @@ +# SPDX-License-Identifier: GPL-2.0 +# +# Hypervisor Enforced Kernel Integrity (Heki) + +config HEKI + bool "Hypervisor Enforced Kernel Integrity (Heki)" + depends on ARCH_SUPPORTS_HEKI + help + This feature enhances guest virtual machine security by taking + advantage of security features provided by the hypervisor for guests. + This feature is helpful in maintaining guest virtual machine security + even after the guest kernel has been compromised. + +config ARCH_SUPPORTS_HEKI + bool "Architecture support for Heki" + help + An architecture should select this when it can successfully build + and run with CONFIG_HEKI. That is, it should provide all of the + architecture support required for the HEKI feature. diff --git a/virt/heki/Makefile b/virt/heki/Makefile new file mode 100644 index ..354e567df71c --- /dev/null +++ b/virt/heki/Makefile @@ -0,0 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only + +obj-y += main.o diff --git a/virt/heki/common.h b/virt/heki/commo
[RFC PATCH v2 17/19] heki: x86: Update permissions counters during text patching
From: Madhavan T. Venkataraman X86 uses a function called __text_poke() to modify executable code. This patching function is used by many features such as KProbes and FTrace. Update the permissions counters for the text page so that write permissions can be temporarily established in the EPT to modify the instructions in that page. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Mickaël Salaün Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Signed-off-by: Madhavan T. Venkataraman --- Changes since v1: * New patch --- arch/x86/kernel/alternative.c | 5 arch/x86/mm/heki.c| 49 +++ include/linux/heki.h | 14 ++ 3 files changed, 68 insertions(+) diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 517ee01503be..64fd8757ba5c 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -1801,6 +1802,7 @@ static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t l */ pgprot = __pgprot(pgprot_val(PAGE_KERNEL) & ~_PAGE_GLOBAL); + heki_text_poke_start(pages, cross_page_boundary ? 2 : 1, pgprot); /* * The lock is not really needed, but this allows to avoid open-coding. */ @@ -1865,7 +1867,10 @@ static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t l } local_irq_restore(flags); + pte_unmap_unlock(ptep, ptl); + heki_text_poke_end(pages, cross_page_boundary ? 2 : 1, pgprot); + return addr; } diff --git a/arch/x86/mm/heki.c b/arch/x86/mm/heki.c index c0eace9e343f..e4c60d8b4f2d 100644 --- a/arch/x86/mm/heki.c +++ b/arch/x86/mm/heki.c @@ -5,8 +5,11 @@ * Copyright © 2023 Microsoft Corporation */ +#include +#include #include #include +#include #ifdef pr_fmt #undef pr_fmt @@ -63,3 +66,49 @@ void heki_pgprot_to_permissions(pgprot_t prot, unsigned long *set, if (pgprot_val(prot) & _PAGE_NX) *clear |= MEM_ATTR_EXEC; } + +static unsigned long heki_pgprot_to_flags(pgprot_t prot) +{ + unsigned long flags = 0; + + if (pgprot_val(prot) & _PAGE_RW) + flags |= _PAGE_RW; + if (pgprot_val(prot) & _PAGE_NX) + flags |= _PAGE_NX; + return flags; +} + +static void heki_text_poke_common(struct page **pages, int npages, + pgprot_t prot, enum heki_cmd cmd) +{ + struct heki_args args = { + .cmd = cmd, + }; + unsigned long va = poking_addr; + int i; + + if (!heki.counters) + return; + + mutex_lock(_lock); + + for (i = 0; i < npages; i++, va += PAGE_SIZE) { + args.va = va; + args.pa = page_to_pfn(pages[i]) << PAGE_SHIFT; + args.size = PAGE_SIZE; + args.flags = heki_pgprot_to_flags(prot); + heki_callback(); + } + + mutex_unlock(_lock); +} + +void heki_text_poke_start(struct page **pages, int npages, pgprot_t prot) +{ + heki_text_poke_common(pages, npages, prot, HEKI_MAP); +} + +void heki_text_poke_end(struct page **pages, int npages, pgprot_t prot) +{ + heki_text_poke_common(pages, npages, prot, HEKI_UNMAP); +} diff --git a/include/linux/heki.h b/include/linux/heki.h index 079b34af07f0..6f2cfddc6dac 100644 --- a/include/linux/heki.h +++ b/include/linux/heki.h @@ -111,6 +111,7 @@ typedef void (*heki_func_t)(struct heki_args *args); extern struct heki heki; extern bool heki_enabled; +extern struct mutex heki_lock; extern bool __read_mostly enable_mbec; @@ -123,12 +124,15 @@ void heki_map(unsigned long va, unsigned long end); void heki_update(unsigned long va, unsigned long end, unsigned long set, unsigned long clear); void heki_unmap(unsigned long va, unsigned long end); +void heki_callback(struct heki_args *args); /* Arch-specific functions. */ void heki_arch_early_init(void); unsigned long heki_flags_to_permissions(unsigned long flags); void heki_pgprot_to_permissions(pgprot_t prot, unsigned long *set, unsigned long *clear); +void heki_text_poke_start(struct page **pages, int npages, pgprot_t prot); +void heki_text_poke_end(struct page **pages, int npages, pgprot_t prot); #else /* !CONFIG_HEKI */ @@ -149,6 +153,16 @@ static inline void heki_unmap(unsigned long va, unsigned long end) { } +/* Arch-specific functions. */ +static inline void heki_text_poke_start(struct page **pages, int npages, + pgprot_t prot) +{ +} +static inline void heki_text_poke_end(struct page **pages, int npages, + pgprot_t prot) +{ +}
[RFC PATCH v2 15/19] heki: x86: Initialize permissions counters for pages in vmap()/vunmap()
From: Madhavan T. Venkataraman When a page gets mapped, create permissions counters for it and initialize them based on the specified permissions. When a page gets unmapped, update the counters appropriately. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Mickaël Salaün Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Signed-off-by: Madhavan T. Venkataraman --- Changes since v1: * New patch --- include/linux/heki.h | 11 ++- mm/vmalloc.c | 7 +++ virt/heki/counters.c | 20 3 files changed, 37 insertions(+), 1 deletion(-) diff --git a/include/linux/heki.h b/include/linux/heki.h index 86c787d121e0..d660994d34d0 100644 --- a/include/linux/heki.h +++ b/include/linux/heki.h @@ -68,7 +68,11 @@ struct heki_hypervisor { * pointer into this heki structure. * * During guest kernel boot, permissions counters for each guest page are - * initialized based on the page's current permissions. + * initialized based on the page's current permissions. Beyond this point, + * the counters are updated whenever: + * + * - a page is mapped into the kernel address space + * - a page is unmapped from the kernel address space */ struct heki { struct heki_hypervisor *hypervisor; @@ -77,6 +81,7 @@ struct heki { enum heki_cmd { HEKI_MAP, + HEKI_UNMAP, }; /* @@ -109,6 +114,7 @@ void heki_counters_init(void); void heki_walk(unsigned long va, unsigned long va_end, heki_func_t func, struct heki_args *args); void heki_map(unsigned long va, unsigned long end); +void heki_unmap(unsigned long va, unsigned long end); /* Arch-specific functions. */ void heki_arch_early_init(void); @@ -125,6 +131,9 @@ static inline void heki_late_init(void) static inline void heki_map(unsigned long va, unsigned long end) { } +static inline void heki_unmap(unsigned long va, unsigned long end) +{ +} #endif /* CONFIG_HEKI */ diff --git a/mm/vmalloc.c b/mm/vmalloc.c index a3fedb3ee0db..d9096502e571 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -40,6 +40,7 @@ #include #include #include +#include #include #include @@ -301,6 +302,8 @@ static int vmap_range_noflush(unsigned long addr, unsigned long end, if (mask & ARCH_PAGE_TABLE_SYNC_MASK) arch_sync_kernel_mappings(start, end); + heki_map(start, end); + return err; } @@ -419,6 +422,8 @@ void __vunmap_range_noflush(unsigned long start, unsigned long end) pgtbl_mod_mask mask = 0; BUG_ON(addr >= end); + heki_unmap(start, end); + pgd = pgd_offset_k(addr); do { next = pgd_addr_end(addr, end); @@ -564,6 +569,8 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end, if (mask & ARCH_PAGE_TABLE_SYNC_MASK) arch_sync_kernel_mappings(start, end); + heki_map(start, end); + return 0; } diff --git a/virt/heki/counters.c b/virt/heki/counters.c index 7067449cabca..adc8d566b8a9 100644 --- a/virt/heki/counters.c +++ b/virt/heki/counters.c @@ -88,6 +88,13 @@ void heki_callback(struct heki_args *args) heki_update_counters(counters, 0, permissions, 0); break; + case HEKI_UNMAP: + if (WARN_ON_ONCE(!counters)) + break; + heki_update_counters(counters, permissions, 0, +permissions); + break; + default: WARN_ON_ONCE(1); break; @@ -124,6 +131,19 @@ void heki_map(unsigned long va, unsigned long end) heki_func(va, end, ); } +/* + * Find the mappings in the given range and revert the permission counters for + * them. + */ +void heki_unmap(unsigned long va, unsigned long end) +{ + struct heki_args args = { + .cmd = HEKI_UNMAP, + }; + + heki_func(va, end, ); +} + /* * Permissions counters are associated with each guest page using the * Memory Table feature. Initialize the permissions counters here. -- 2.42.1
[RFC PATCH v2 19/19] virt: Add Heki KUnit tests
This adds a new CONFIG_HEKI_TEST option to run tests at boot. Because we use some symbols not exported to modules (e.g., kernel_set_to_readonly) this could not work as modules. To run these tests, we need to boot the kernel with the heki_test=N boot argument with N selecting a specific test: 1. heki_test_cr_disable_smep: Check CR pinning and try to disable SMEP. 2. heki_test_write_to_const: Check .rodata (const) protection. 3. heki_test_write_to_ro_after_init: Check __ro_after_init protection. 4. heki_test_exec: Check non-executable kernel memory. This way to select tests should not be required when the kernel will properly handle the triggered synthetic page faults. For now, these page faults make the kernel loop. All these tests temporarily disable the related kernel self-protections and should then failed if Heki doesn't protect the kernel. They are verbose to make it easier to understand what is going on. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Signed-off-by: Mickaël Salaün --- Changes since v1: * Move all tests to virt/heki/tests.c --- include/linux/heki.h | 1 + virt/heki/Kconfig| 12 +++ virt/heki/Makefile | 1 + virt/heki/main.c | 6 +- virt/heki/tests.c| 207 +++ 5 files changed, 226 insertions(+), 1 deletion(-) create mode 100644 virt/heki/tests.c diff --git a/include/linux/heki.h b/include/linux/heki.h index 306bcec7ae92..9e2cf0051ab0 100644 --- a/include/linux/heki.h +++ b/include/linux/heki.h @@ -149,6 +149,7 @@ void heki_protect(unsigned long va, unsigned long end); void heki_add_pa(struct heki_args *args, phys_addr_t pa, unsigned long permissions); void heki_apply_permissions(struct heki_args *args); +void heki_run_test(void); /* Arch-specific functions. */ void heki_arch_early_init(void); diff --git a/virt/heki/Kconfig b/virt/heki/Kconfig index 9bde84cd759e..fa814a921bb0 100644 --- a/virt/heki/Kconfig +++ b/virt/heki/Kconfig @@ -28,3 +28,15 @@ config HYPERVISOR_SUPPORTS_HEKI A hypervisor should select this when it can successfully build and run with CONFIG_HEKI. That is, it should provide all of the hypervisor support required for the Heki feature. + +config HEKI_TEST + bool "Tests for Heki" if !KUNIT_ALL_TESTS + depends on HEKI && KUNIT=y + default KUNIT_ALL_TESTS + help + Run Heki tests at runtime according to the heki_test=N boot + parameter, with N identifying the test to run (between 1 and 4). + + Before launching the init process, the system might not respond + because of unhandled kernel page fault. This will be fixed in a + next patch series. diff --git a/virt/heki/Makefile b/virt/heki/Makefile index 564f92faa9d8..a66cd0ba140b 100644 --- a/virt/heki/Makefile +++ b/virt/heki/Makefile @@ -3,3 +3,4 @@ obj-y += main.o obj-y += walk.o obj-y += counters.o +obj-y += tests.o diff --git a/virt/heki/main.c b/virt/heki/main.c index 5629334112e7..ce9984231996 100644 --- a/virt/heki/main.c +++ b/virt/heki/main.c @@ -51,8 +51,10 @@ void heki_late_init(void) { struct heki_hypervisor *hypervisor = heki.hypervisor; - if (!heki.counters) + if (!heki.counters) { + heki_run_test(); return; + } /* Locks control registers so a compromised guest cannot change them. */ if (WARN_ON(hypervisor->lock_crs())) @@ -61,6 +63,8 @@ void heki_late_init(void) pr_warn("Control registers locked\n"); heki_arch_late_init(); + + heki_run_test(); } /* diff --git a/virt/heki/tests.c b/virt/heki/tests.c new file mode 100644 index ..6e6542b257f1 --- /dev/null +++ b/virt/heki/tests.c @@ -0,0 +1,207 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Hypervisor Enforced Kernel Integrity (Heki) - Common code + * + * Copyright © 2023 Microsoft Corporation + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "common.h" + +#ifdef CONFIG_HEKI_TEST + +/* Heki test data */ + +/* Takes two pages to not change permission of other read-only pages. */ +const char heki_test_const_buf[PAGE_SIZE * 2] = {}; +char heki_test_ro_after_init_buf[PAGE_SIZE * 2] __ro_after_init = {}; + +long heki_test_exec_data(long); +void _test_exec_data_end(void); + +/* Used to test ROP execution against the .rodata section. */ +/* clang-format off */ +asm( +".pushsection .rodata;" // NOT .text section +".global heki_test_exec_data;" +".type heki_test_exec_data, @function;" +"heki_test_exec_data:" +ASM_ENDBR +"movq %rdi, %rax;" +"inc %rax;" +ASM_RET +".size heki_test_exec_da
[RFC PATCH v2 10/19] KVM: x86: Implement per-guest-page permissions
Define memory attributes that can be associated with guest physical pages in KVM. To begin with, define permissions as memory attributes (READ, WRITE and EXECUTE), and the IMMUTABLE property. In the future, other attributes could be defined. Use the memory attribute feature to implement the following functions in KVM: - kvm_permissions_set(): Set the permissions for a guest page in the memory attribute XArray. - kvm_permissions_get(): Retrieve the permissions associated with a guest page in same XArray. These functions will be called in a following commit to associate proper permissions with guest pages instead of RWX for all the pages. Add 4 new memory attributes, private to the KVM implementation: - KVM_MEMORY_ATTRIBUTE_HEKI_READ - KVM_MEMORY_ATTRIBUTE_HEKI_WRITE - KVM_MEMORY_ATTRIBUTE_HEKI_EXEC - KVM_MEMORY_ATTRIBUTE_HEKI_IMMUTABLE Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Mickaël Salaün Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Co-developed-by: Madhavan T. Venkataraman Signed-off-by: Madhavan T. Venkataraman Signed-off-by: Mickaël Salaün --- Changes since v1: * New patch replacing the deprecated page tracking mechanism. * Add new files: virt/lib/kvm_permissions.c and include/linux/kvm_mem_attr.h * Add new kvm_permissions_get() and kvm_permissions_set() leveraging the to-be-upstream memory attributes for KVM. * Introduce the KVM_MEMORY_ATTRIBUTE_HEKI_* values. --- arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/Makefile| 4 +- include/linux/kvm_mem_attr.h | 32 +++ include/uapi/linux/kvm.h | 5 ++ virt/heki/Kconfig| 1 + virt/lib/kvm_permissions.c | 104 +++ 6 files changed, 146 insertions(+), 1 deletion(-) create mode 100644 include/linux/kvm_mem_attr.h create mode 100644 virt/lib/kvm_permissions.c diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 7a3b52b7e456..ea6d73241632 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -50,6 +50,7 @@ config KVM select HAVE_KVM_PM_NOTIFIER if PM select KVM_GENERIC_HARDWARE_ENABLING select HYPERVISOR_SUPPORTS_HEKI + select SPARSEMEM help Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index 80e3fe184d17..aac51a5d2cae 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -9,10 +9,12 @@ endif include $(srctree)/virt/kvm/Makefile.kvm +VIRT_LIB = ../../../virt/lib + kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \ i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \ hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o \ - mmu/spte.o + mmu/spte.o $(VIRT_LIB)/kvm_permissions.o ifdef CONFIG_HYPERV kvm-y += kvm_onhyperv.o diff --git a/include/linux/kvm_mem_attr.h b/include/linux/kvm_mem_attr.h new file mode 100644 index ..0a755025e553 --- /dev/null +++ b/include/linux/kvm_mem_attr.h @@ -0,0 +1,32 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * KVM guest page permissions - Definitions. + * + * Copyright © 2023 Microsoft Corporation. + */ +#ifndef __KVM_MEM_ATTR_H__ +#define __KVM_MEM_ATTR_H__ + +#include +#include + +/* clang-format off */ + +#define MEM_ATTR_READ BIT(0) +#define MEM_ATTR_WRITE BIT(1) +#define MEM_ATTR_EXEC BIT(2) +#define MEM_ATTR_IMMUTABLE BIT(3) + +#define MEM_ATTR_PROT ( \ + MEM_ATTR_READ | \ + MEM_ATTR_WRITE | \ + MEM_ATTR_EXEC | \ + MEM_ATTR_IMMUTABLE) + +/* clang-format on */ + +int kvm_permissions_set(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end, + unsigned long heki_attr); +unsigned long kvm_permissions_get(struct kvm *kvm, gfn_t gfn); + +#endif /* __KVM_MEM_ATTR_H__ */ diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 2477b4a16126..2b5b90216565 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -2319,6 +2319,11 @@ struct kvm_memory_attributes { #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3) +#define KVM_MEMORY_ATTRIBUTE_HEKI_READ (1ULL << 4) +#define KVM_MEMORY_ATTRIBUTE_HEKI_WRITE(1ULL << 5) +#define KVM_MEMORY_ATTRIBUTE_HEKI_EXEC (1ULL << 6) +#define KVM_MEMORY_ATTRIBUTE_HEKI_IMMUTABLE(1ULL << 7) + #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd) struct kvm_create_guest_memfd { diff --git a/virt/heki/Kconfig b/virt/heki/Kconfig index 5ea75b595667..75a784653e31 100644 --- a/virt/heki/Kconfig +++ b/virt/heki/Kconfig @@ -5,6 +5,7 @@ config HEKI bool "H
[RFC PATCH v2 11/19] KVM: x86: Add new hypercall to set EPT permissions
From: Madhavan T. Venkataraman Add a new KVM_HC_PROTECT_MEMORY hypercall that enables a guest to set EPT permissions for guest pages. Until now, all of the guest pages (except Page Tracked pages) are given RWX permissions in the EPT. In Heki, we want to restrict the permissions to what is strictly needed. For instance, a text page only needs R_X. A read-only data page only needs R__. A normal data page only needs RW_. The guest will pass a page list to the hypercall. The page list is a list of one or more physical pages each of which contains a array of guest ranges and attributes. Currently, the attributes only contain permissions. In the future, other attributes may be added. The hypervisor will apply the specified permissions in the EPT. When a guest try to access its memory in a way which is not allowed, KVM creates a synthetic kernel page fault. This fault should be handled by the guest, which is not currently the case, making it try again and again. This will be part of a follow-up patch series. When enabled, KASAN reveals a bug in the memory attributes patches. We didn't find the source of this issue yet. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Co-developed-by: Mickaël Salaün Signed-off-by: Mickaël Salaün Signed-off-by: Madhavan T. Venkataraman --- Changes since v1: The original hypercall contained support for statically defined sections (text, rodata, etc). It has been redesigned like this: - The previous version accepted an array of physically contiguous ranges. This is appropriate for statically defined sections which are loaded in contiguous memory. But, for other cases like module loading, the pages would be discontiguous. The current version of the hypercall accepts a page list to fix this. - The previous version passed permission combinations. E.g., HEKI_MEM_ATTR_EXEC would imply R_X. The current version passes permissions as memory attributes and each of the permissions must be separately specified. E.g., for text, (MEM_ATTR_READ | MEM_ATTR_EXEC) must be passed. - The previous version locked down the permissions for guest pages so that once the permissions are set, they cannot be changed. In this version, permissions can be changed dynamically, except when the MEM_ATTR_IMMUTABLE is set. So, the hypercall has been renamed from KVM_HC_LOCK_MEM_PAGE_RANGES to KVM_HC_PROTECT_MEMORY. The dynamic setting of permissions is needed by the following features (probably not a complete list): - Kprobes and Optprobes - Static call optimization - Jump Label optimization - Ftrace and Livepatch - Module loading and unloading - eBPF JIT - Kexec - Kgdb Examples: - A text page can be made writable very briefly to install a probe or a trace. - eBPF JIT can populate a writable page with code and make it read-execute. - Module load can load read-only data into a writable page and make the page read-only. - When pages are unmapped, their permissions in the EPT must revert to read-write. --- Documentation/virt/kvm/x86/hypercalls.rst | 14 +++ arch/x86/kvm/mmu/mmu.c| 77 + arch/x86/kvm/mmu/paging_tmpl.h| 3 + arch/x86/kvm/mmu/spte.c | 15 ++- arch/x86/kvm/x86.c| 130 ++ include/linux/heki.h | 29 + include/uapi/linux/kvm_para.h | 1 + 7 files changed, 267 insertions(+), 2 deletions(-) diff --git a/Documentation/virt/kvm/x86/hypercalls.rst b/Documentation/virt/kvm/x86/hypercalls.rst index 3178576f4c47..28865d111773 100644 --- a/Documentation/virt/kvm/x86/hypercalls.rst +++ b/Documentation/virt/kvm/x86/hypercalls.rst @@ -207,3 +207,17 @@ The hypercall lets a guest request control register flags to be pinned for itself. Returns 0 on success or a KVM error code otherwise. + +10. KVM_HC_PROTECT_MEMORY +- + +:Architecture: x86 +:Status: active +:Purpose: Request permissions to be set in EPT + +- a0: physical address of a struct heki_page_list + +The hypercall lets a guest request memory permissions to be set for a list +of physical pages. + +Returns 0 on success or a KVM error code otherwise. diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 2024ff21d036..2d09bcc35462 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -47,9 +47,11 @@ #include #include #include +#include #include #include #include +#include #include #include @@ -4446,6 +4448,75 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu, mmu_invalidate_retry_gfn(vcpu->kvm, fault->mmu_seq, fault->gfn); } +static bool mem_attr_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) +{ + unsigned long perm; + bool noexec, nowrite; + + if (unlikely(fa
[RFC PATCH v2 04/19] heki: Lock guest control registers at the end of guest kernel init
The hypervisor needs to provide some functions to support Heki. These form the Heki-Hypervisor API. Define a heki_hypervisor structure to house the API functions. A hypervisor that supports Heki must instantiate a heki_hypervisor structure and pass it to the Heki common code. This allows the common code to access these functions in a hypervisor-agnostic way. The first function that is implemented is lock_crs() (lock control registers). That is, certain flags in the control registers are pinned so that they can never be changed for the lifetime of the guest. Implement Heki support in the guest: - Each supported hypervisor in x86 implements a set of functions for the guest kernel. Add an init_heki() function to that set. This function initializes Heki-related stuff. Call init_heki() for the detected hypervisor in init_hypervisor_platform(). - Implement init_heki() for the guest. - Implement kvm_lock_crs() in the guest to lock down control registers. This function calls a KVM hypercall to do the job. - Instantiate a heki_hypervisor structure that contains a pointer to kvm_lock_crs(). - Pass the heki_hypervisor structure to Heki common code in init_heki(). Implement a heki_late_init() function and call it at the end of kernel init. This function calls lock_crs(). In other words, control registers of a guest are locked down at the end of guest kernel init. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Co-developed-by: Madhavan T. Venkataraman Signed-off-by: Madhavan T. Venkataraman Signed-off-by: Mickaël Salaün --- Changes since v1: * Shrinked the patch to only manage the CR pinning. --- arch/x86/include/asm/x86_init.h | 1 + arch/x86/kernel/cpu/hypervisor.c | 1 + arch/x86/kernel/kvm.c| 56 arch/x86/kvm/Kconfig | 1 + include/linux/heki.h | 22 + init/main.c | 1 + virt/heki/Kconfig| 9 - virt/heki/main.c | 25 ++ 8 files changed, 115 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h index 5240d88db52a..ff4dfd2f615e 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -127,6 +127,7 @@ struct x86_hyper_init { bool (*msi_ext_dest_id)(void); void (*init_mem_mapping)(void); void (*init_after_bootmem)(void); + void (*init_heki)(void); }; /** diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c index 553bfbfc3a1b..6085c8129e0c 100644 --- a/arch/x86/kernel/cpu/hypervisor.c +++ b/arch/x86/kernel/cpu/hypervisor.c @@ -106,4 +106,5 @@ void __init init_hypervisor_platform(void) x86_hyper_type = h->type; x86_init.hyper.init_platform(); + x86_init.hyper.init_heki(); } diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index b8ab9ee5896c..8349f4ad3bbd 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -997,6 +998,60 @@ static bool kvm_sev_es_hcall_finish(struct ghcb *ghcb, struct pt_regs *regs) } #endif +#ifdef CONFIG_HEKI + +extern unsigned long cr4_pinned_mask; + +/* + * TODO: Check SMP policy consistency, e.g. with + * this_cpu_read(cpu_tlbstate.cr4) + */ +static int kvm_lock_crs(void) +{ + unsigned long cr4; + int err; + + err = kvm_hypercall3(KVM_HC_LOCK_CR_UPDATE, 0, X86_CR0_WP, 0); + if (err) + return err; + + cr4 = __read_cr4(); + err = kvm_hypercall3(KVM_HC_LOCK_CR_UPDATE, 4, cr4 & cr4_pinned_mask, +0); + return err; +} + +static struct heki_hypervisor kvm_heki_hypervisor = { + .lock_crs = kvm_lock_crs, +}; + +static void kvm_init_heki(void) +{ + long err; + + if (!kvm_para_available()) { + /* Cannot make KVM hypercalls. */ + return; + } + + err = kvm_hypercall3(KVM_HC_LOCK_CR_UPDATE, 0, 0, +KVM_LOCK_CR_UPDATE_VERSION); + if (err < 1) { + /* Ignores host not supporting at least the first version. */ + return; + } + + heki.hypervisor = _heki_hypervisor; +} + +#else /* CONFIG_HEKI */ + +static void kvm_init_heki(void) +{ +} + +#endif /* CONFIG_HEKI */ + const __initconst struct hypervisor_x86 x86_hyper_kvm = { .name = "KVM", .detect = kvm_detect, @@ -1005,6 +1060,7 @@ const __initconst struct hypervisor_x86 x86_hyper_kvm = { .init.x2apic_available = kvm_para_available, .init.msi_ext_dest_id = kvm_msi_ext_dest_id, .init.init_platform
[RFC PATCH v2 03/19] KVM: x86: Add notifications for Heki policy configuration and violation
Add an interface for user space to be notified about guests' Heki policy and related violations. Extend the KVM_ENABLE_CAP IOCTL with KVM_CAP_HEKI_CONFIGURE and KVM_CAP_HEKI_DENIAL. Each one takes a bitmask as first argument that can contains KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4. The returned value is the bitmask of known Heki exit reasons, for now: KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4. If KVM_CAP_HEKI_CONFIGURE is set, a VM exit will be triggered for each KVM_HC_LOCK_CR_UPDATE hypercalls according to the requested control register. This enables to enlighten the VMM with the guest auto-restrictions. If KVM_CAP_HEKI_DENIAL is set, a VM exit will be triggered for each pinned CR violation. This enables the VMM to react to a policy violation. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Signed-off-by: Mickaël Salaün --- Changes since v1: * New patch. Making user space aware of Heki properties was requested by Sean Christopherson. --- arch/x86/kvm/vmx/vmx.c | 5 +- arch/x86/kvm/x86.c | 114 +++ arch/x86/kvm/x86.h | 7 +-- include/linux/kvm_host.h | 2 + include/uapi/linux/kvm.h | 22 5 files changed, 136 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index f487bf16dd96..b631b1d7ba30 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5444,6 +5444,7 @@ static int handle_cr(struct kvm_vcpu *vcpu) int reg; int err; int ret; + bool exit = false; exit_qualification = vmx_get_exit_qual(vcpu); cr = exit_qualification & 15; @@ -5453,8 +5454,8 @@ static int handle_cr(struct kvm_vcpu *vcpu) val = kvm_register_read(vcpu, reg); trace_kvm_cr_write(cr, val); - ret = heki_check_cr(vcpu, cr, val); - if (ret) + ret = heki_check_cr(vcpu, cr, val, ); + if (exit) return ret; switch (cr) { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4e6c4c21f12c..43c28a6953bf 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -119,6 +119,10 @@ static u64 __read_mostly cr4_reserved_bits = CR4_RESERVED_BITS; #define KVM_CAP_PMU_VALID_MASK KVM_PMU_CAP_DISABLE +#define KVM_HEKI_EXIT_REASON_VALID_MASK ( \ + KVM_HEKI_EXIT_REASON_CR0 | \ + KVM_HEKI_EXIT_REASON_CR4) + #define KVM_X2APIC_API_VALID_FLAGS (KVM_X2APIC_API_USE_32BIT_IDS | \ KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK) @@ -4644,6 +4648,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) if (kvm_is_vm_type_supported(KVM_X86_SW_PROTECTED_VM)) r |= BIT(KVM_X86_SW_PROTECTED_VM); break; + case KVM_CAP_HEKI_CONFIGURE: + case KVM_CAP_HEKI_DENIAL: + r = KVM_HEKI_EXIT_REASON_VALID_MASK; + break; default: break; } @@ -6518,6 +6526,22 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, } mutex_unlock(>lock); break; +#ifdef CONFIG_HEKI + case KVM_CAP_HEKI_CONFIGURE: + r = -EINVAL; + if (cap->args[0] & ~KVM_HEKI_EXIT_REASON_VALID_MASK) + break; + kvm->heki_configure_exit_reason = cap->args[0]; + r = 0; + break; + case KVM_CAP_HEKI_DENIAL: + r = -EINVAL; + if (cap->args[0] & ~KVM_HEKI_EXIT_REASON_VALID_MASK) + break; + kvm->heki_denial_exit_reason = cap->args[0]; + r = 0; + break; +#endif default: r = -EINVAL; break; @@ -8056,11 +8080,60 @@ static unsigned long emulator_get_cr(struct x86_emulate_ctxt *ctxt, int cr) #ifdef CONFIG_HEKI +static int complete_heki_configure_exit(struct kvm_vcpu *const vcpu) +{ + kvm_rax_write(vcpu, 0); + ++vcpu->stat.hypercalls; + return kvm_skip_emulated_instruction(vcpu); +} + +static int complete_heki_denial_exit(struct kvm_vcpu *const vcpu) +{ + kvm_inject_gp(vcpu, 0); + return 1; +} + +/* Returns true if the @exit_reason is handled by @vcpu->kvm. */ +static bool heki_exit_cr(struct kvm_vcpu *const vcpu, const __u32 exit_reason, +const u64 heki_reason, unsigned long value) +{ + switch (exit_reason) { + case KVM_EXIT_HEKI_CONFIGURE: + if (!(vcpu->kvm->heki_configure_exit_reason & heki_reason)) + return false; + + vcpu->run->heki_configure.reason = heki_reason; +
[RFC PATCH v2 05/19] KVM: VMX: Add MBEC support
This changes add support for VMX_FEATURE_MODE_BASED_EPT_EXEC (named ept_mode_based_exec in /proc/cpuinfo and MBEC elsewhere), which enables to separate EPT execution bits for supervisor vs. user. It transforms the semantic of VMX_EPT_EXECUTABLE_MASK from a global execution to a kernel execution, and use the VMX_EPT_USER_EXECUTABLE_MASK bit to identify user execution. The main use case is to be able to restrict kernel execution while ignoring user space execution from the hypervisor point of view. Indeed, user space execution can already be restricted by the guest kernel. This change enables MBEC but doesn't change the default configuration, which is to allow execution for all guest memory. However, the next commit levages MBEC to restrict kernel memory pages. MBEC can be configured with the new "enable_mbec" module parameter, set to true by default. However, MBEC is disable for L1 and L2 for now. The MMU tracepoints are updated to reflect the difference between kernel and user space executions, see is_executable_pte(). Replace EPT_VIOLATION_RWX_MASK (3 bits) with 4 dedicated EPT_VIOLATION_READ, EPT_VIOLATION_WRITE, EPT_VIOLATION_KERNEL_INSTR, and EPT_VIOLATION_USER_INSTR bits. >From the Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3C (System Programming Guide), Part 3: SECONDARY_EXEC_MODE_BASED_EPT_EXEC (bit 22): If either the "unrestricted guest" VM-execution control or the "mode-based execute control for EPT" VM-execution control is 1, the "enable EPT" VM-execution control must also be 1. EPT_VIOLATION_KERNEL_INSTR_BIT (bit 5): The logical-AND of bit 2 in the EPT paging-structure entries used to translate the guest-physical address of the access causing the EPT violation. If the "mode-based execute control for EPT" VM-execution control is 0, this indicates whether the guest-physical address was executable. If that control is 1, this indicates whether the guest-physical address was executable for supervisor-mode linear addresses. EPT_VIOLATION_USER_INSTR_BIT (bit 6): If the "mode-based execute control" VM-execution control is 0, the value of this bit is undefined. If that control is 1, this bit is the logical-AND of bit 10 in the EPT paging-structures entries used to translate the guest-physical address of the access causing the EPT violation. In this case, it indicates whether the guest-physical address was executable for user-mode linear addresses. PT_USER_EXEC_MASK (bit 10): Execute access for user-mode linear addresses. If the "mode-based execute control for EPT" VM-execution control is 1, indicates whether instruction fetches are allowed from user-mode linear addresses in the 512-GByte region controlled by this entry. If that control is 0, this bit is ignored. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Signed-off-by: Mickaël Salaün --- Changes since v1: * Import the MMU tracepoint changes from the v1's "Enable guests to lock themselves thanks to MBEC" patch. --- arch/x86/include/asm/vmx.h | 11 +-- arch/x86/kvm/mmu.h | 3 ++- arch/x86/kvm/mmu/mmu.c | 8 ++-- arch/x86/kvm/mmu/mmutrace.h | 11 +++ arch/x86/kvm/mmu/paging_tmpl.h | 16 ++-- arch/x86/kvm/mmu/spte.c | 4 +++- arch/x86/kvm/mmu/spte.h | 15 +-- arch/x86/kvm/vmx/capabilities.h | 7 +++ arch/x86/kvm/vmx/nested.c | 7 +++ arch/x86/kvm/vmx/vmx.c | 29 ++--- arch/x86/kvm/vmx/vmx.h | 1 + 11 files changed, 95 insertions(+), 17 deletions(-) diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 0e73616b82f3..7fd390484b36 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -513,6 +513,7 @@ enum vmcs_field { #define VMX_EPT_IPAT_BIT (1ull << 6) #define VMX_EPT_ACCESS_BIT (1ull << 8) #define VMX_EPT_DIRTY_BIT (1ull << 9) +#define VMX_EPT_USER_EXECUTABLE_MASK (1ull << 10) #define VMX_EPT_RWX_MASK(VMX_EPT_READABLE_MASK | \ VMX_EPT_WRITABLE_MASK | \ VMX_EPT_EXECUTABLE_MASK) @@ -558,13 +559,19 @@ enum vm_entry_failure_code { #define EPT_VIOLATION_ACC_READ_BIT 0 #define EPT_VIOLATION_ACC_WRITE_BIT1 #define EPT_VIOLATION_ACC_INSTR_BIT2 -#define EPT_VIOLATION_RWX_SHIFT3 +#define EPT_VIOLATION_READ_BIT 3 +#define EPT_VIOLATION_WRITE_BIT4 +#define EPT_VIOLATION_KERNEL_INSTR_BIT 5 +#define EPT_VIOLATION_USER_INSTR_BIT 6 #define EPT_VIOLATION_GVA_IS_VALID_BIT 7 #define
[RFC PATCH v2 16/19] heki: x86: Update permissions counters when guest page permissions change
From: Madhavan T. Venkataraman When permissions are changed on an existing mapping, update the permissions counters. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Mickaël Salaün Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Signed-off-by: Madhavan T. Venkataraman --- Changes since v1: * New patch --- arch/x86/mm/heki.c | 9 +++ arch/x86/mm/pat/set_memory.c | 51 include/linux/heki.h | 14 ++ virt/heki/counters.c | 23 4 files changed, 97 insertions(+) diff --git a/arch/x86/mm/heki.c b/arch/x86/mm/heki.c index c495df0d8772..c0eace9e343f 100644 --- a/arch/x86/mm/heki.c +++ b/arch/x86/mm/heki.c @@ -54,3 +54,12 @@ unsigned long heki_flags_to_permissions(unsigned long flags) return permissions; } + +void heki_pgprot_to_permissions(pgprot_t prot, unsigned long *set, + unsigned long *clear) +{ + if (pgprot_val(prot) & _PAGE_RW) + *set |= MEM_ATTR_WRITE; + if (pgprot_val(prot) & _PAGE_NX) + *clear |= MEM_ATTR_EXEC; +} diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index bda9f129835e..6aaa1ce5692c 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include @@ -2056,11 +2057,56 @@ int clear_mce_nospec(unsigned long pfn) EXPORT_SYMBOL_GPL(clear_mce_nospec); #endif /* CONFIG_X86_64 */ +#ifdef CONFIG_HEKI + +static void heki_change_page_attr_set(unsigned long va, int numpages, + pgprot_t set) +{ + unsigned long va_end; + unsigned long set_permissions = 0, clear_permissions = 0; + + heki_pgprot_to_permissions(set, _permissions, _permissions); + if (!(set_permissions | clear_permissions)) + return; + + va_end = va + (numpages << PAGE_SHIFT); + heki_update(va, va_end, set_permissions, clear_permissions); +} + +static void heki_change_page_attr_clear(unsigned long va, int numpages, + pgprot_t clear) +{ + unsigned long va_end; + unsigned long set_permissions = 0, clear_permissions = 0; + + heki_pgprot_to_permissions(clear, _permissions, _permissions); + if (!(set_permissions | clear_permissions)) + return; + + va_end = va + (numpages << PAGE_SHIFT); + heki_update(va, va_end, set_permissions, clear_permissions); +} + +#else /* !CONFIG_HEKI */ + +static void heki_change_page_attr_set(unsigned long va, int numpages, + pgprot_t set) +{ +} + +static void heki_change_page_attr_clear(unsigned long va, int numpages, + pgprot_t clear) +{ +} + +#endif /* CONFIG_HEKI */ + int set_memory_x(unsigned long addr, int numpages) { if (!(__supported_pte_mask & _PAGE_NX)) return 0; + heki_change_page_attr_clear(addr, numpages, __pgprot(_PAGE_NX)); return change_page_attr_clear(, numpages, __pgprot(_PAGE_NX), 0); } @@ -2069,11 +2115,14 @@ int set_memory_nx(unsigned long addr, int numpages) if (!(__supported_pte_mask & _PAGE_NX)) return 0; + heki_change_page_attr_set(addr, numpages, __pgprot(_PAGE_NX)); return change_page_attr_set(, numpages, __pgprot(_PAGE_NX), 0); } int set_memory_ro(unsigned long addr, int numpages) { + // TODO: What about _PAGE_DIRTY? + heki_change_page_attr_clear(addr, numpages, __pgprot(_PAGE_RW)); return change_page_attr_clear(, numpages, __pgprot(_PAGE_RW | _PAGE_DIRTY), 0); } @@ -2084,11 +2133,13 @@ int set_memory_rox(unsigned long addr, int numpages) if (__supported_pte_mask & _PAGE_NX) clr.pgprot |= _PAGE_NX; + heki_change_page_attr_clear(addr, numpages, clr); return change_page_attr_clear(, numpages, clr, 0); } int set_memory_rw(unsigned long addr, int numpages) { + heki_change_page_attr_set(addr, numpages, __pgprot(_PAGE_RW)); return change_page_attr_set(, numpages, __pgprot(_PAGE_RW), 0); } diff --git a/include/linux/heki.h b/include/linux/heki.h index d660994d34d0..079b34af07f0 100644 --- a/include/linux/heki.h +++ b/include/linux/heki.h @@ -73,6 +73,7 @@ struct heki_hypervisor { * * - a page is mapped into the kernel address space * - a page is unmapped from the kernel address space + * - permissions are changed for a mapped page */ struct heki { struct heki_hypervisor *hypervisor; @@ -81,6 +82,7 @@ struct heki { enum heki_cmd { HEKI_MAP, + HEKI_UPDATE, HEKI_UNMAP, }; @@ -98,6 +100,10 @@ struct heki_args { /* Command passed by caller. */ enum heki_c
[RFC PATCH v2 14/19] heki: x86: Initialize permissions counters for pages mapped into KVA
From: Madhavan T. Venkataraman Define a permissions counters structure that contains a counter for read, write and execute. Each mapped guest page will be allocated a permissions counters structure. During kernel boot, walk the kernel address space, locate all the mappings, create permissions counters for each mapped guest page and update the counters to reflect the collective permissions for each page across all of its mappings. The collective permissions will be applied in the EPT in a following commit. We might want to move these counters to a safer place (e.g., KVM) to protect it from tampering by the guest kernel itself. We should note that walking through all mappings might be slow if KASAN is enabled. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Mickaël Salaün Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Suggested-by: Mickaël Salaün Signed-off-by: Madhavan T. Venkataraman --- Changes since v1: * New patch and new files: arch/x86/mm/heki.c and virt/heki/counters.c --- arch/x86/mm/Makefile | 2 + arch/x86/mm/heki.c | 56 + include/linux/heki.h | 32 ++ virt/heki/Kconfig| 2 + virt/heki/Makefile | 1 + virt/heki/counters.c | 147 +++ virt/heki/main.c | 13 7 files changed, 253 insertions(+) create mode 100644 arch/x86/mm/heki.c create mode 100644 virt/heki/counters.c diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index c80febc44cd2..2998eaac0dbb 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -67,3 +67,5 @@ obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_amd.o obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o + +obj-$(CONFIG_HEKI) += heki.o diff --git a/arch/x86/mm/heki.c b/arch/x86/mm/heki.c new file mode 100644 index ..c495df0d8772 --- /dev/null +++ b/arch/x86/mm/heki.c @@ -0,0 +1,56 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Hypervisor Enforced Kernel Integrity (Heki) - Arch specific. + * + * Copyright © 2023 Microsoft Corporation + */ + +#include +#include + +#ifdef pr_fmt +#undef pr_fmt +#endif + +#define pr_fmt(fmt) "heki-guest: " fmt + +static unsigned long kernel_va; +static unsigned long kernel_end; +static unsigned long direct_map_va; +static unsigned long direct_map_end; + +__init void heki_arch_early_init(void) +{ + /* Kernel virtual address space range, not yet compatible with KASLR. */ + if (pgtable_l5_enabled()) { + kernel_va = 0xff00UL; + kernel_end = 0xffe0UL; + direct_map_va = 0xff11UL; + direct_map_end = 0xff91UL; + } else { + kernel_va = 0x8000UL; + kernel_end = 0xffe0UL; + direct_map_va = 0x8880UL; + direct_map_end = 0xc880UL; + } + + /* +* Initialize the counters for all existing kernel mappings except +* for direct map. +*/ + heki_map(kernel_va, direct_map_va); + heki_map(direct_map_end, kernel_end); +} + +unsigned long heki_flags_to_permissions(unsigned long flags) +{ + unsigned long permissions; + + permissions = MEM_ATTR_READ | MEM_ATTR_EXEC; + if (flags & _PAGE_RW) + permissions |= MEM_ATTR_WRITE; + if (flags & _PAGE_NX) + permissions &= ~MEM_ATTR_EXEC; + + return permissions; +} diff --git a/include/linux/heki.h b/include/linux/heki.h index a7ae0b387dfe..86c787d121e0 100644 --- a/include/linux/heki.h +++ b/include/linux/heki.h @@ -19,6 +19,16 @@ #ifdef CONFIG_HEKI +/* + * This structure keeps track of the collective permissions for a guest page + * across all of its mappings. + */ +struct heki_counters { + int read; + int write; + int execute; +}; + /* * This structure contains a guest physical range and its permissions (RWX). */ @@ -56,9 +66,17 @@ struct heki_hypervisor { /* * If the active hypervisor supports Heki, it will plug its heki_hypervisor * pointer into this heki structure. + * + * During guest kernel boot, permissions counters for each guest page are + * initialized based on the page's current permissions. */ struct heki { struct heki_hypervisor *hypervisor; + struct mem_table *counters; +}; + +enum heki_cmd { + HEKI_MAP, }; /* @@ -72,6 +90,9 @@ struct heki_args { phys_addr_t pa; size_t size; unsigned long flags; + + /* Command passed by caller. */ + enum heki_cmd cmd; }; /* Callback function called by the table walker. */ @@ -84,6 +105,14 @@ extern bool __read_mostly enable_mbec; void heki_early_init(void); void heki_late_init(void); +void heki_counters_init(
[RFC PATCH v2 06/19] KVM: x86: Add kvm_x86_ops.fault_gva()
This function is needed for kvm_mmu_page_fault() to create synthetic page faults. Code originally written by Mihai Donțu and Nicușor Cîțu: https://lore.kernel.org/r/20211006173113.26445-18-ala...@bitdefender.com Renamed fault_gla() to fault_gva() and use the new EPT_VIOLATION_GVA_IS_VALID. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Co-developed-by: Mihai Donțu Signed-off-by: Mihai Donțu Co-developed-by: Nicușor Cîțu Signed-off-by: Nicușor Cîțu Signed-off-by: Mickaël Salaün --- arch/x86/include/asm/kvm-x86-ops.h | 1 + arch/x86/include/asm/kvm_host.h| 2 ++ arch/x86/kvm/svm/svm.c | 9 + arch/x86/kvm/vmx/vmx.c | 10 ++ 4 files changed, 22 insertions(+) diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h index e3054e3e46d5..ba3db679db2b 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h +++ b/arch/x86/include/asm/kvm-x86-ops.h @@ -134,6 +134,7 @@ KVM_X86_OP(msr_filter_changed) KVM_X86_OP(complete_emulated_msr) KVM_X86_OP(vcpu_deliver_sipi_vector) KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons); +KVM_X86_OP(fault_gva) #undef KVM_X86_OP #undef KVM_X86_OP_OPTIONAL diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index dff10051e9b6..0415dacd4b28 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1750,6 +1750,8 @@ struct kvm_x86_ops { * Returns vCPU specific APICv inhibit reasons */ unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu); + + u64 (*fault_gva)(struct kvm_vcpu *vcpu); }; struct kvm_x86_nested_ops { diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index beea99c8e8e0..d32517a2cf9c 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4906,6 +4906,13 @@ static int svm_vm_init(struct kvm *kvm) return 0; } +static u64 svm_fault_gva(struct kvm_vcpu *vcpu) +{ + const struct vcpu_svm *svm = to_svm(vcpu); + + return svm->vcpu.arch.cr2 ? svm->vcpu.arch.cr2 : ~0ull; +} + static struct kvm_x86_ops svm_x86_ops __initdata = { .name = KBUILD_MODNAME, @@ -5037,6 +5044,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = { .vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector, .vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons, + + .fault_gva = svm_fault_gva, }; /* diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 1b1581f578b0..a8158bc1dda9 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -8233,6 +8233,14 @@ static void vmx_vm_destroy(struct kvm *kvm) free_pages((unsigned long)kvm_vmx->pid_table, vmx_get_pid_table_order(kvm)); } +static u64 vmx_fault_gva(struct kvm_vcpu *vcpu) +{ + if (vcpu->arch.exit_qualification & EPT_VIOLATION_GVA_IS_VALID) + return vmcs_readl(GUEST_LINEAR_ADDRESS); + + return ~0ull; +} + static struct kvm_x86_ops vmx_x86_ops __initdata = { .name = KBUILD_MODNAME, @@ -8373,6 +8381,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = { .complete_emulated_msr = kvm_complete_insn_gp, .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector, + + .fault_gva = vmx_fault_gva, }; static unsigned int vmx_handle_intel_pt_intr(void) -- 2.42.1
[RFC PATCH v2 07/19] KVM: x86: Make memory attribute helpers more generic
To make it useful for other use cases such as Heki, remove the private memory optimizations. I guess we could try to infer the applied attributes to get back these optimizations when it makes sense, but let's keep this simple for now. Main changes: - Replace slots_lock with slots_arch_lock to make it callable from a KVM hypercall. - Move this mutex lock into kvm_vm_ioctl_set_mem_attributes() to make it easier to use with other locks. - Export kvm_vm_set_mem_attributes(). - Remove the kvm_arch_pre_set_memory_attributes() and kvm_arch_post_set_memory_attributes() KVM_MEMORY_ATTRIBUTE_PRIVATE optimizations. Cc: Chao Peng Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Sean Christopherson Cc: Yu Zhang Signed-off-by: Mickaël Salaün --- Changes since v1: * New patch --- arch/x86/kvm/mmu/mmu.c | 23 --- include/linux/kvm_host.h | 2 ++ virt/kvm/kvm_main.c | 19 ++- 3 files changed, 12 insertions(+), 32 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 7e053973125c..4d378d308762 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -7251,20 +7251,6 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm) bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm, struct kvm_gfn_range *range) { - /* -* Zap SPTEs even if the slot can't be mapped PRIVATE. KVM x86 only -* supports KVM_MEMORY_ATTRIBUTE_PRIVATE, and so it *seems* like KVM -* can simply ignore such slots. But if userspace is making memory -* PRIVATE, then KVM must prevent the guest from accessing the memory -* as shared. And if userspace is making memory SHARED and this point -* is reached, then at least one page within the range was previously -* PRIVATE, i.e. the slot's possible hugepage ranges are changing. -* Zapping SPTEs in this case ensures KVM will reassess whether or not -* a hugepage can be used for affected ranges. -*/ - if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm))) - return false; - return kvm_unmap_gfn_range(kvm, range); } @@ -7313,15 +7299,6 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, lockdep_assert_held_write(>mmu_lock); lockdep_assert_held(>slots_lock); - /* -* Calculate which ranges can be mapped with hugepages even if the slot -* can't map memory PRIVATE. KVM mustn't create a SHARED hugepage over -* a range that has PRIVATE GFNs, and conversely converting a range to -* SHARED may now allow hugepages. -*/ - if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm))) - return false; - /* * The sequence matters here: upper levels consume the result of lower * level's scanning. diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ec32af17add8..85b8648fd892 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2396,6 +2396,8 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, struct kvm_gfn_range *range); +int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, + unsigned long attributes); static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn) { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 23633984142f..0096ccfbb609 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2552,7 +2552,7 @@ static bool kvm_pre_set_memory_attributes(struct kvm *kvm, } /* Set @attributes for the gfn range [@start, @end). */ -static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, +int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, unsigned long attributes) { struct kvm_mmu_notifier_range pre_set_range = { @@ -2577,11 +2577,11 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, entry = attributes ? xa_mk_value(attributes) : NULL; - mutex_lock(>slots_lock); + lockdep_assert_held(>slots_arch_lock); /* Nothing to do if the entire range as the desired attributes. */ if (kvm_range_has_memory_attributes(kvm, start, end, attributes)) - goto out_unlock; + return r; /* * Reserve memory ahead of time to avoid having to deal with failures @@ -2590,7 +2590,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, for (i = start; i < end; i++) { r = xa_reserve(>mem_attr_array, i, GFP_KERNEL_ACCOUNT); if (r) - goto out_unlock; + return r; } kvm_handle_gfn_ran
[RFC PATCH v2 02/19] KVM: x86: Add new hypercall to lock control registers
This enables guests to lock their CR0 and CR4 registers with a subset of X86_CR0_WP, X86_CR4_SMEP, X86_CR4_SMAP, X86_CR4_UMIP, X86_CR4_FSGSBASE and X86_CR4_CET flags. The new KVM_HC_LOCK_CR_UPDATE hypercall takes three arguments. The first is to identify the control register, the second is a bit mask to pin (i.e. mark as read-only), and the third is for optional flags. These register flags should already be pinned by Linux guests, but once compromised, this self-protection mechanism could be disabled, which is not the case with this dedicated hypercall. Once the CRs are pinned by the guest, if it attempts to change them, then a general protection fault is sent to the guest. This hypercall may evolve and support new kind of registers or pinning. The optional KVM_LOCK_CR_UPDATE_VERSION flag enables guests to know the supported abilities by mapping the returned version with the related features. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Signed-off-by: Mickaël Salaün --- Changes since v1: * Guard KVM_HC_LOCK_CR_UPDATE hypercall with CONFIG_HEKI. * Move extern cr4_pinned_mask to x86.h (suggested by Kees Cook). * Move VMX CR checks from vmx_set_cr*() to handle_cr() to make it possible to return to user space (see next commit). * Change the heki_check_cr()'s first argument to vcpu. * Don't use -KVM_EPERM in heki_check_cr(). * Generate a fault when the guest requests a denied CR update. * Add a flags argument to get the version of this hypercall. Being able to do a preper version check was suggested by Wei Liu. --- Documentation/virt/kvm/x86/hypercalls.rst | 17 + arch/x86/include/uapi/asm/kvm_para.h | 2 + arch/x86/kernel/cpu/common.c | 4 +- arch/x86/kvm/vmx/vmx.c| 5 ++ arch/x86/kvm/x86.c| 84 +++ arch/x86/kvm/x86.h| 22 ++ include/linux/kvm_host.h | 5 ++ include/uapi/linux/kvm_para.h | 1 + 8 files changed, 139 insertions(+), 1 deletion(-) diff --git a/Documentation/virt/kvm/x86/hypercalls.rst b/Documentation/virt/kvm/x86/hypercalls.rst index 10db7924720f..3178576f4c47 100644 --- a/Documentation/virt/kvm/x86/hypercalls.rst +++ b/Documentation/virt/kvm/x86/hypercalls.rst @@ -190,3 +190,20 @@ the KVM_CAP_EXIT_HYPERCALL capability. Userspace must enable that capability before advertising KVM_FEATURE_HC_MAP_GPA_RANGE in the guest CPUID. In addition, if the guest supports KVM_FEATURE_MIGRATION_CONTROL, userspace must also set up an MSR filter to process writes to MSR_KVM_MIGRATION_CONTROL. + +9. KVM_HC_LOCK_CR_UPDATE + + +:Architecture: x86 +:Status: active +:Purpose: Request some control registers to be restricted. + +- a0: identify a control register +- a1: bit mask to make some flags read-only +- a2: optional KVM_LOCK_CR_UPDATE_VERSION flag that will return the version of + this hypercall. Version 1 supports CR0 and CR4 pinning. + +The hypercall lets a guest request control register flags to be pinned for +itself. + +Returns 0 on success or a KVM error code otherwise. diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 6e64b27b2c1e..efc5ccc0060f 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -150,4 +150,6 @@ struct kvm_vcpu_pv_apf_data { #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK #define KVM_PV_EOI_DISABLED 0x0 +#define KVM_LOCK_CR_UPDATE_VERSION (1 << 0) + #endif /* _UAPI_ASM_X86_KVM_PARA_H */ diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 4e5ffc8b0e46..f18ee7ce0496 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -400,9 +400,11 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c) } /* These bits should not change their value after CPU init is finished. */ -static const unsigned long cr4_pinned_mask = +const unsigned long cr4_pinned_mask = X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP | X86_CR4_FSGSBASE | X86_CR4_CET; +EXPORT_SYMBOL_GPL(cr4_pinned_mask); + static DEFINE_STATIC_KEY_FALSE_RO(cr_pinning); static unsigned long cr4_pinned_bits __ro_after_init; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 6e502ba93141..f487bf16dd96 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5452,6 +5452,11 @@ static int handle_cr(struct kvm_vcpu *vcpu) case 0: /* mov to cr */ val = kvm_register_read(vcpu, reg); trace_kvm_cr_write(cr, val); + + ret = heki_check_cr(vcpu, cr, val); + if (ret) + return ret; + switch (cr) { case 0: err = handle_set_cr0(vcpu, val); diff --git
Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity
On 31/05/2023 22:24, Sean Christopherson wrote: On Tue, May 30, 2023, Rick P Edgecombe wrote: On Fri, 2023-05-26 at 17:22 +0200, Micka�l Sala�n wrote: Can the guest kernel ask the host VMM's emulated devices to DMA into the protected data? It should go through the host userspace mappings I think, which don't care about EPT permissions. Or did I miss where you are protecting that another way? There are a lot of easy ways to ask the host to write to guest memory that don't involve the EPT. You probably need to protect the host userspace mappings, and also the places in KVM that kmap a GPA provided by the guest. Good point, I'll check this confused deputy attack. Extended KVM protections should indeed handle all ways to map guests' memory. I'm wondering if current VMMs would gracefully handle such new restrictions though. I guess the host could map arbitrary data to the guest, so that need to be handled, but how could the VMM (not the host kernel) bypass/update EPT initially used for the guest (and potentially later mapped to the host)? Well traditionally both QEMU and KVM accessed guest memory via host mappings instead of the EPT.�So I'm wondering what is stopping the guest from passing a protected gfn when setting up the DMA, and QEMU being enticed to write to it? The emulator as well would use these host userspace mappings and not consult the EPT IIRC. I think Sean was suggesting host userspace should be more involved in this process, so perhaps it could protect its own alias of the protected memory, for example mprotect() it as read-only. Ya, though "suggesting" is really "demanding, unless someone provides super strong justification for handling this directly in KVM". It's basically the same argument that led to Linux Security Modules: I'm all for KVM providing the framework and plumbing, but I don't want KVM to get involved in defining policy, thread models, etc. I agree that KVM should not provide its own policy but only the building blocks to enforce one. There is two complementary points: - policy definition by the guest, provided to KVM and the host; - policy enforcement by KVM and the host. A potential extension of this framework could be to enable the host to define it's own policy for guests, but this would be a different threat model. To avoid too much latency because of the host being involved in policy enforcement, I'd like to explore an asynchronous approach that would especially fit well for dynamic restrictions.
Re: [ANNOUNCE] KVM Microconference at LPC 2023
Hi, What is the status of this microconference proposal? We'd be happy to talk about Heki [1] and potentially other hypervisor supports. Regards, Mickaël [1] https://lore.kernel.org/all/20230505152046.6575-1-...@digikod.net/ On 26/05/2023 18:09, Mickaël Salaün wrote: See James Morris's proposal here: https://lore.kernel.org/all/17f62cb1-a5de-2020-2041-359b8e96b...@linux.microsoft.com/ On 26/05/2023 04:36, James Morris wrote: > [Side topic] > > Would folks be interested in a Linux Plumbers Conference MC on this > topic generally, across different hypervisors, VMMs, and architectures? > > If so, please let me know who the key folk would be and we can try writing > up an MC proposal. The fine-grain memory management proposal from James Gowans looks interesting, especially the "side-car" virtual machines: https://lore.kernel.org/all/88db2d9cb42e471692ff1feb0b9ca855906a9d95.ca...@amazon.com/ On 09/05/2023 11:55, Paolo Bonzini wrote: Hi all! We are planning on submitting a CFP to host a KVM Microconference at Linux Plumbers Conference 2023. To help justify the proposal, we would like to gather a list of folks that would likely attend, and crowdsource a list of topics to include in the proposal. For both this year and future years, the intent is that a KVM Microconference will complement KVM Forum, *NOT* supplant it. As you probably noticed, KVM Forum is going through a somewhat radical change in how it's organized; the conference is now free and (with some help from Red Hat) organized directly by the KVM and QEMU communities. Despite the unexpected changes and some teething pains, community response to KVM Forum continues to be overwhelmingly positive! KVM Forum will remain the venue of choice for KVM/userspace collaboration, for educational content covering both KVM and userspace, and to discuss new features in QEMU and other userspace projects. At least on the x86 side, however, the success of KVM Forum led us virtualization folks to operate in relative isolation. KVM depends on and impacts multiple subsystems (MM, scheduler, perf) in profound ways, and recently we’ve seen more and more ideas/features that require non-trivial changes outside KVM and buy-in from stakeholders that (typically) do not attend KVM Forum. Linux Plumbers Conference is a natural place to establish such collaboration within the kernel. Therefore, the aim of the KVM Microconference will be: * to provide a setting in which to discuss KVM and kernel internals * to increase collaboration and reduce friction with other subsystems * to discuss system virtualization issues that require coordination with other subsystems (such as VFIO, or guest support in arch/) Below is a rough draft of the planned CFP submission. Thanks! Paolo Bonzini (KVM Maintainer) Sean Christopherson (KVM x86 Co-Maintainer) Marc Zyngier (KVM ARM Co-Maintainer) === KVM Microconference === KVM (Kernel-based Virtual Machine) enables the use of hardware features to improve the efficiency, performance, and security of virtual machines created and managed by userspace. KVM was originally developed to host and accelerate "full" virtual machines running a traditional kernel and operating system, but has long since expanded to cover a wide array of use cases, e.g. hosting real time workloads, sandboxing untrusted workloads, deprivileging third party code, reducing the trusted computed base of security sensitive workloads, etc. As KVM's use cases have grown, so too have the requirements placed on KVM and the interactions between it and other kernel subsystems. The KVM Microconference will focus on how to evolve KVM and adjacent subsystems in order to satisfy new and upcoming requirements: serving guest memory that cannot be accessed by host userspace[1], providing accurate, feature-rich PMU/perf virtualization in cloud VMs[2], etc. Potential Topics: - Serving inaccessible/unmappable memory for KVM guests (protected VMs) - Optimizing mmu_notifiers, e.g. reducing TLB flushes and spurious zapping - Supporting multiple KVM modules (for non-disruptive upgrades) - Improving and hardening KVM+perf interactions - Implementing arch-agnostic abstractions in KVM (e.g. MMU) - Defining KVM requirements for hardware vendors - Utilizing "fault" injection to increase test coverage of edge cases - KVM vs VFIO (e.g. memory types, a rather hot topic on the ARM side) Key Attendees: - Paolo Bonzini (KVM Maintainer) - Sean Christopherson (KVM x86 Co-Maintainer) - Your name could be here! [1] https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.p...@linux.intel.com [2] https://lore.kernel.org/all/CALMp9eRBOmwz=mspp0m5q093k3rmueasf3vel39mgv5br9w...@mail.gmail.com
Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity
On 25/05/2023 20:34, Trilok Soni wrote: On 5/25/2023 6:25 AM, Mickaël Salaün wrote: On 24/05/2023 23:04, Trilok Soni wrote: On 5/5/2023 8:20 AM, Mickaël Salaün wrote: Hi, This patch series is a proof-of-concept that implements new KVM features (extended page tracking, MBEC support, CR pinning) and defines a new API to protect guest VMs. No VMM (e.g., Qemu) modification is required. The main idea being that kernel self-protection mechanisms should be delegated to a more privileged part of the system, hence the hypervisor. It is still the role of the guest kernel to request such restrictions according to its Only for the guest kernel images here? Why not for the host OS kernel? As explained in the Future work section, protecting the host would be useful, but that doesn't really fit with the KVM model. The Protected KVM project is a first step to help in this direction [11]. In a nutshell, KVM is close to a type-2 hypervisor, and the host kernel is also part of the hypervisor. Embedded devices w/ Android you have mentioned below supports the host OS as well it seems, right? What do you mean? I think you have answered this above w/ pKVM and I was referring the host protection as well w/ Heki. The link/references below refers to the Android OS it seems and not guest VM. Do we suggest that all the functionalities should be implemented in the Hypervisor (NS-EL2 for ARM) or even at Secure EL like Secure-EL1 (ARM). KVM runs in EL2. TrustZone is mainly used to enforce DRM, which means that we may not control the related code. This patch series is dedicated to hypervisor-enforced kernel integrity, then KVM. I am hoping that whatever we suggest the interface here from the Guest to the Hypervisor becomes the ABI right? Yes, hypercalls are part of the KVM ABI. Sure. I just hope that they are extensible enough to support for other Hypervisors too. I am not sure if they are on this list like ACRN / Xen and see if it fits their need too. KVM, Hyper-V and Xen mailing lists are CCed. The KVM hypercalls are specific to KVM, but this patch series also include a common guest API intended to be used with all hypervisors. Is there any other Hypervisor you plan to test this feature as well? We're also working on Hyper-V. # Current limitations The main limitation of this patch series is the statically enforced permissions. This is not an issue for kernels without module but this needs to be addressed. Mechanisms that dynamically impact kernel executable memory are not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), and such code will need to be authenticated. Because the hypervisor is highly privileged and critical to the security of all the VMs, we don't want to implement a code authentication mechanism in the hypervisor itself but delegate this verification to something much less privileged. We are thinking of two ways to solve this: implement this verification in the VMM or spawn a dedicated special VM (similar to Windows's VBS). There are pros on cons to each approach: complexity, verification code ownership (guest's or VMM's), access to guest memory (i.e., confidential computing). Do you foresee the performance regressions due to lot of tracking here? The performance impact of execution prevention should be negligible because once configured the hypervisor do nothing except catch illegitimate access attempts. Yes, if you are using the static kernel only and not considering the other dynamic patching features like explained. They need to be thought upon differently to reduce the likely impact. What do you mean? We plan to support dynamic code, and performance is of course part of the requirement. Production kernels do have lot of tracepoints and we use it as feature in the GKI kernel for the vendor hooks implementation and in those cases every vendor driver is a module. As explained in this section, dynamic kernel modifications such as tracepoints or modules are not currently supported by this patch series. Handling tracepoints is possible but requires more work to define and check legitimate changes. This proposal is still useful for static kernels though. Separate VM further fragments this design and delegates more of it to proprietary solutions? What do you mean? KVM is not a proprietary solution. Ah, I was referring the VBS Windows VM mentioned in the above text. Is it open-source? The reference of VM (or dedicated VM) didn't mention that VM itself will be open-source running Linux kernel. This patch series is dedicated to KVM. Windows VBS was only mentioned as a comparable (but much more advanced) set of features. Everything required to use this new KVM features is and will be open-source. There is nothing to worry about licensing, the goal is to make it widely and freely available to protect users. For dynamic checks, this would require code not run by KVM itself, but either the VMM or a dedicated VM
Re: [PATCH v1 5/9] KVM: x86: Add new hypercall to lock control registers
On 08/05/2023 23:11, Wei Liu wrote: On Fri, May 05, 2023 at 05:20:42PM +0200, Mickaël Salaün wrote: This enables guests to lock their CR0 and CR4 registers with a subset of X86_CR0_WP, X86_CR4_SMEP, X86_CR4_SMAP, X86_CR4_UMIP, X86_CR4_FSGSBASE and X86_CR4_CET flags. The new KVM_HC_LOCK_CR_UPDATE hypercall takes two arguments. The first is to identify the control register, and the second is a bit mask to pin (i.e. mark as read-only). These register flags should already be pinned by Linux guests, but once compromised, this self-protection mechanism could be disabled, which is not the case with this dedicated hypercall. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Signed-off-by: Mickaël Salaün Link: https://lore.kernel.org/r/20230505152046.6575-6-...@digikod.net [...] hw_cr4 = (cr4_read_shadow() & X86_CR4_MCE) | (cr4 & ~X86_CR4_MCE); if (is_unrestricted_guest(vcpu)) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ffab64d08de3..a529455359ac 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7927,11 +7927,77 @@ static unsigned long emulator_get_cr(struct x86_emulate_ctxt *ctxt, int cr) return value; } +#ifdef CONFIG_HEKI + +extern unsigned long cr4_pinned_mask; + Can this be moved to a header file? Yep, but I'm not sure which one. Any preference Kees? +static int heki_lock_cr(struct kvm *const kvm, const unsigned long cr, + unsigned long pin) +{ + if (!pin) + return -KVM_EINVAL; + + switch (cr) { + case 0: + /* Cf. arch/x86/kernel/cpu/common.c */ + if (!(pin & X86_CR0_WP)) + return -KVM_EINVAL; + + if ((read_cr0() & pin) != pin) + return -KVM_EINVAL; + + atomic_long_or(pin, >heki_pinned_cr0); + return 0; + case 4: + /* Checks for irrelevant bits. */ + if ((pin & cr4_pinned_mask) != pin) + return -KVM_EINVAL; + It is enforcing the host mask on the guest, right? If the guest's set is a super set of the host's then it will get rejected. + /* Ignores bits not present in host. */ + pin &= __read_cr4(); + atomic_long_or(pin, >heki_pinned_cr4); We assume that the host's mask is a superset of the guest's mask. I guess we should check the absolute supported bits instead, even if it would be weird for the host to not support these bits. + return 0; + } + return -KVM_EINVAL; +} + +int heki_check_cr(const struct kvm *const kvm, const unsigned long cr, + const unsigned long val) +{ + unsigned long pinned; + + switch (cr) { + case 0: + pinned = atomic_long_read(>heki_pinned_cr0); + if ((val & pinned) != pinned) { + pr_warn_ratelimited( + "heki-kvm: Blocked CR0 update: 0x%lx\n", val); I think if the message contains the VM and VCPU identifier it will become more useful. Indeed, and this should be the case for all log messages, but I'd left that for future work. ;) I'll update the logs for the next series with a new kvm_warn_ratelimited() helper using VCPU's PID.
Re: [PATCH v1 3/9] virt: Implement Heki common code
On 17/05/2023 14:47, Madhavan T. Venkataraman wrote: Sorry for the delay. See inline... On 5/8/23 12:29, Wei Liu wrote: On Fri, May 05, 2023 at 05:20:40PM +0200, Mickaël Salaün wrote: From: Madhavan T. Venkataraman Hypervisor Enforced Kernel Integrity (Heki) is a feature that will use the hypervisor to enhance guest virtual machine security. Configuration = Define the config variables for the feature. This feature depends on support from the architecture as well as the hypervisor. Enabling HEKI = Define a kernel command line parameter "heki" to turn the feature on or off. By default, Heki is on. For such a newfangled feature can we have it off by default? Especially when there are unsolved issues around dynamically loaded code. Yes. We can certainly do that. By default the Kconfig option should definitely be off. We also need to change the Kconfig option to only be set if kernel module, JIT, kprobes and other dynamic text change feature are disabled at build time (see discussion with Sean). With this new Kconfig option for the static case, I think the boot option should be on by default because otherwise it would not really be possible to switch back to on later without taking the risk to silently breaking users' machines. However, we should rename this option to something like "heki_static" to be in line with the new Kconfig option. The goal of Heki is to improve and complement kernel self-protection mechanisms (which don't have boot time options), and to make it available to everyone, see https://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project/Recommended_Settings In practice, it would then be kind of useless to be required to set a boot option to enable Heki (rather than to disable it). [...] diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 3604074a878b..5cf5a7a97811 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -297,6 +297,7 @@ config X86 select FUNCTION_ALIGNMENT_4B imply IMA_SECURE_AND_OR_TRUSTED_BOOTif EFI select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE + select ARCH_SUPPORTS_HEKI if X86_64 Why is there a restriction on X86_64? We want to get the PoC working and reviewed on X64 first. We have tested this only on X64 so far. X86_64 includes Intel CPUs, which can support EPT and MBEC, which are a requirement for Heki. ARM might have similar features but we're focused on x86 for now. As a side note, I only have access to an Intel machine, which means that I cannot work on AMD support. However, I'll be pleased to implement such support if I get access to a machine with a recent AMD CPU. config INSTRUCTION_DECODER def_bool y diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h index a6e8373a5170..42ef1e33b8a5 100644 --- a/arch/x86/include/asm/sections.h +++ b/arch/x86/include/asm/sections.h [...] +#ifdef CONFIG_HEKI + +/* + * Gather all of the statically defined sections so heki_late_init() can + * protect these sections in the host page table. + * + * The sections are defined under "SECTIONS" in vmlinux.lds.S + * Keep this array in sync with SECTIONS. + */ This seems a bit fragile, because it requires constant attention from people who care about this functionality. Can this table be automatically generated? We realize that. But I don't know of a way this can be automatically generated. Also, the permissions for each section is specific to the use of that section. The developer who introduces a new section is the one who will know what the permissions should be. If any one has any ideas of how we can generate this table automatically or even just add a build time check of some sort, please let us know. One clean solution might be to parse the vmlinux.lds.S file, extract section and their permission, and fill that into an automatically generated header file. Another way to do it would be to extract sections and associated permissions with objdump, but that could be an issue because of longer build time. A better solution would be to extract such sections and associated permissions at boot time. I guess the kernel already has such helpers used in early boot.
Re: [PATCH v1 6/9] KVM: x86: Add Heki hypervisor support
On 08/05/2023 23:18, Wei Liu wrote: On Fri, May 05, 2023 at 05:20:43PM +0200, Mickaël Salaün wrote: From: Madhavan T. Venkataraman Each supported hypervisor in x86 implements a struct x86_hyper_init to define the init functions for the hypervisor. Define a new init_heki() entry point in struct x86_hyper_init. Hypervisors that support Heki must define this init_heki() function. Call init_heki() of the chosen hypervisor in init_hypervisor_platform(). Create a heki_hypervisor structure that each hypervisor can fill with its data and functions. This will allow the Heki feature to work in a hypervisor agnostic way. Declare and initialize a "heki_hypervisor" structure for KVM so KVM can support Heki. Define the init_heki() function for KVM. In init_heki(), set the hypervisor field in the generic "heki" structure to the KVM "heki_hypervisor". After this point, generic Heki code can access the KVM Heki data and functions. [...] +static void kvm_init_heki(void) +{ + long err; + + if (!kvm_para_available()) + /* Cannot make KVM hypercalls. */ + return; + + err = kvm_hypercall3(KVM_HC_LOCK_MEM_PAGE_RANGES, -1, -1, -1); Why not do a proper version check or capability check here? If the ABI or supported features ever change then we have something to rely on? The attributes will indeed get extended, but I wanted to have a simple proposal for now. Do you mean to get the version of this hypercall e.g., with a dedicated flag, like with the landlock_create_ruleset/LANDLOCK_CREATE_RULESET_VERSION syscall? Thanks, Wei.
Re: [ANNOUNCE] KVM Microconference at LPC 2023
See James Morris's proposal here: https://lore.kernel.org/all/17f62cb1-a5de-2020-2041-359b8e96b...@linux.microsoft.com/ On 26/05/2023 04:36, James Morris wrote: > [Side topic] > > Would folks be interested in a Linux Plumbers Conference MC on this > topic generally, across different hypervisors, VMMs, and architectures? > > If so, please let me know who the key folk would be and we can try writing > up an MC proposal. The fine-grain memory management proposal from James Gowans looks interesting, especially the "side-car" virtual machines: https://lore.kernel.org/all/88db2d9cb42e471692ff1feb0b9ca855906a9d95.ca...@amazon.com/ On 09/05/2023 11:55, Paolo Bonzini wrote: Hi all! We are planning on submitting a CFP to host a KVM Microconference at Linux Plumbers Conference 2023. To help justify the proposal, we would like to gather a list of folks that would likely attend, and crowdsource a list of topics to include in the proposal. For both this year and future years, the intent is that a KVM Microconference will complement KVM Forum, *NOT* supplant it. As you probably noticed, KVM Forum is going through a somewhat radical change in how it's organized; the conference is now free and (with some help from Red Hat) organized directly by the KVM and QEMU communities. Despite the unexpected changes and some teething pains, community response to KVM Forum continues to be overwhelmingly positive! KVM Forum will remain the venue of choice for KVM/userspace collaboration, for educational content covering both KVM and userspace, and to discuss new features in QEMU and other userspace projects. At least on the x86 side, however, the success of KVM Forum led us virtualization folks to operate in relative isolation. KVM depends on and impacts multiple subsystems (MM, scheduler, perf) in profound ways, and recently we’ve seen more and more ideas/features that require non-trivial changes outside KVM and buy-in from stakeholders that (typically) do not attend KVM Forum. Linux Plumbers Conference is a natural place to establish such collaboration within the kernel. Therefore, the aim of the KVM Microconference will be: * to provide a setting in which to discuss KVM and kernel internals * to increase collaboration and reduce friction with other subsystems * to discuss system virtualization issues that require coordination with other subsystems (such as VFIO, or guest support in arch/) Below is a rough draft of the planned CFP submission. Thanks! Paolo Bonzini (KVM Maintainer) Sean Christopherson (KVM x86 Co-Maintainer) Marc Zyngier (KVM ARM Co-Maintainer) === KVM Microconference === KVM (Kernel-based Virtual Machine) enables the use of hardware features to improve the efficiency, performance, and security of virtual machines created and managed by userspace. KVM was originally developed to host and accelerate "full" virtual machines running a traditional kernel and operating system, but has long since expanded to cover a wide array of use cases, e.g. hosting real time workloads, sandboxing untrusted workloads, deprivileging third party code, reducing the trusted computed base of security sensitive workloads, etc. As KVM's use cases have grown, so too have the requirements placed on KVM and the interactions between it and other kernel subsystems. The KVM Microconference will focus on how to evolve KVM and adjacent subsystems in order to satisfy new and upcoming requirements: serving guest memory that cannot be accessed by host userspace[1], providing accurate, feature-rich PMU/perf virtualization in cloud VMs[2], etc. Potential Topics: - Serving inaccessible/unmappable memory for KVM guests (protected VMs) - Optimizing mmu_notifiers, e.g. reducing TLB flushes and spurious zapping - Supporting multiple KVM modules (for non-disruptive upgrades) - Improving and hardening KVM+perf interactions - Implementing arch-agnostic abstractions in KVM (e.g. MMU) - Defining KVM requirements for hardware vendors - Utilizing "fault" injection to increase test coverage of edge cases - KVM vs VFIO (e.g. memory types, a rather hot topic on the ARM side) Key Attendees: - Paolo Bonzini (KVM Maintainer) - Sean Christopherson (KVM x86 Co-Maintainer) - Your name could be here! [1] https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.p...@linux.intel.com [2] https://lore.kernel.org/all/CALMp9eRBOmwz=mspp0m5q093k3rmueasf3vel39mgv5br9w...@mail.gmail.com
Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity
On 25/05/2023 17:52, Edgecombe, Rick P wrote: On Thu, 2023-05-25 at 15:59 +0200, Mickaël Salaün wrote: [ snip ] The kernel often creates writable aliases in order to write to protected data (kernel text, etc). Some of this is done right as text is being first written out (alternatives for example), and some happens way later (jump labels, etc). So for verification, I wonder what stage you would be verifying? If you want to verify the end state, you would have to maintain knowledge in the verifier of all the touch-ups the kernel does. I think it would get very tricky. For now, in the static kernel case, all rodata and text GPA is restricted, so aliasing such memory in a writable way before or after the KVM enforcement would still restrict write access to this memory, which could be an issue but not a security one. Do you have such examples in mind? On x86, look at all the callers of the text_poke() family. In arch/x86/include/asm/text-patching.h. OK, thanks! It also seems it will be a decent ask for the guest kernel to keep track of GPA permissions as well as normal virtual memory pemirssions, if this thing is not widely used. This would indeed be required to properly handle the dynamic cases. So I wondering if you could go in two directions with this: 1. Make this a feature only for super locked down kernels (no modules, etc). Forbid any configurations that might modify text. But eBPF is used for seccomp, so you might be turning off some security protections to get this. Good idea. For "super locked down kernels" :) , we should disable all kernel executable changes with the related kernel build configuration (e.g. eBPF JIT, kernel module, kprobes…) to make sure there is no such legitimate access. This looks like an acceptable initial feature. How many users do you think will want this protection but not protections that would have to be disabled? The main one that came to mind for me is cBPF seccomp stuff. But also, the alternative to JITing cBPF is the eBPF interpreter which AFAIU is considered a juicy enough target for speculative attacks that they created an option to compile it out. And leaving an interpreter in the kernel means any data could be "executed" in the normal non- speculative scenario, kind of working around the hypervisor executable protections. Dropping e/cBPF entirely would be an option, but then I wonder how many users you have left. Hopefully that is all correct, it's hard to keep track with the pace of BPF development. seccomp-bpf doesn't rely on JIT, so it is not an issue. For eBPF, JIT is optional, but other text changes may be required according to the eBPF program type (e.g. using kprobes). I wonder if it might be a good idea to POC the guest side before settling on the KVM interface. Then you can also look at the whole thing and judge how much usage it would get for the different options of restrictions. The next step is to handle dynamic permissions, but it will be easier to first implement that in KVM itself (which already has the required authentication code). The current interface may be flexible enough though, only new attribute flags should be required (and potentially an async mode). Anyway, this will enable to look at the whole thing. 2. Loosen the rules to allow the protections to not be so one-way enable. Get less security, but used more widely. This is our goal. I think both static and dynamic cases are legitimate and have value according to the level of security sought. This should be a build-time configuration. Yea, the proper way to do this is probably to move all text handling stuff into a separate domain of some sort, like you mentioned elsewhere. It would be quite a job. Not necessarily to move this code, but to make sure that the changes are legitimate (e.g. text signatures, legitimate addresses). This doesn't need to be perfect but it should improve the current state by increasing the cost of attacks.
Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity
On 25/05/2023 15:59, Mickaël Salaün wrote: On 25/05/2023 00:20, Edgecombe, Rick P wrote: On Fri, 2023-05-05 at 17:20 +0200, Mickaël Salaün wrote: # How does it work? This implementation mainly leverages KVM capabilities to control the Second Layer Address Translation (or the Two Dimensional Paging e.g., Intel's EPT or AMD's RVI/NPT) and Mode Based Execution Control (Intel's MBEC) introduced with the Kaby Lake (7th generation) architecture. This allows to set permissions on memory pages in a complementary way to the guest kernel's managed memory permissions. Once these permissions are set, they are locked and there is no way back. A first KVM_HC_LOCK_MEM_PAGE_RANGES hypercall enables the guest kernel to lock a set of its memory page ranges with either the HEKI_ATTR_MEM_NOWRITE or the HEKI_ATTR_MEM_EXEC attribute. The first one denies write access to a specific set of pages (allow-list approach), and the second only allows kernel execution for a set of pages (deny-list approach). The current implementation sets the whole kernel's .rodata (i.e., any const or __ro_after_init variables, which includes critical security data such as LSM parameters) and .text sections as non-writable, and the .text section is the only one where kernel execution is allowed. This is possible thanks to the new MBEC support also brough by this series (otherwise the vDSO would have to be executable). Thanks to this hardware support (VT-x, EPT and MBEC), the performance impact of such guest protection is negligible. The second KVM_HC_LOCK_CR_UPDATE hypercall enables guests to pin some of its CPU control register flags (e.g., X86_CR0_WP, X86_CR4_SMEP, X86_CR4_SMAP), which is another complementary hardening mechanism. Heki can be enabled with the heki=1 boot command argument. Can the guest kernel ask the host VMM's emulated devices to DMA into the protected data? It should go through the host userspace mappings I think, which don't care about EPT permissions. Or did I miss where you are protecting that another way? There are a lot of easy ways to ask the host to write to guest memory that don't involve the EPT. You probably need to protect the host userspace mappings, and also the places in KVM that kmap a GPA provided by the guest. Good point, I'll check this confused deputy attack. Extended KVM protections should indeed handle all ways to map guests' memory. I'm wondering if current VMMs would gracefully handle such new restrictions though. I guess the host could map arbitrary data to the guest, so that need to be handled, but how could the VMM (not the host kernel) bypass/update EPT initially used for the guest (and potentially later mapped to the host)?
Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity
On 25/05/2023 00:20, Edgecombe, Rick P wrote: On Fri, 2023-05-05 at 17:20 +0200, Mickaël Salaün wrote: # How does it work? This implementation mainly leverages KVM capabilities to control the Second Layer Address Translation (or the Two Dimensional Paging e.g., Intel's EPT or AMD's RVI/NPT) and Mode Based Execution Control (Intel's MBEC) introduced with the Kaby Lake (7th generation) architecture. This allows to set permissions on memory pages in a complementary way to the guest kernel's managed memory permissions. Once these permissions are set, they are locked and there is no way back. A first KVM_HC_LOCK_MEM_PAGE_RANGES hypercall enables the guest kernel to lock a set of its memory page ranges with either the HEKI_ATTR_MEM_NOWRITE or the HEKI_ATTR_MEM_EXEC attribute. The first one denies write access to a specific set of pages (allow-list approach), and the second only allows kernel execution for a set of pages (deny-list approach). The current implementation sets the whole kernel's .rodata (i.e., any const or __ro_after_init variables, which includes critical security data such as LSM parameters) and .text sections as non-writable, and the .text section is the only one where kernel execution is allowed. This is possible thanks to the new MBEC support also brough by this series (otherwise the vDSO would have to be executable). Thanks to this hardware support (VT-x, EPT and MBEC), the performance impact of such guest protection is negligible. The second KVM_HC_LOCK_CR_UPDATE hypercall enables guests to pin some of its CPU control register flags (e.g., X86_CR0_WP, X86_CR4_SMEP, X86_CR4_SMAP), which is another complementary hardening mechanism. Heki can be enabled with the heki=1 boot command argument. Can the guest kernel ask the host VMM's emulated devices to DMA into the protected data? It should go through the host userspace mappings I think, which don't care about EPT permissions. Or did I miss where you are protecting that another way? There are a lot of easy ways to ask the host to write to guest memory that don't involve the EPT. You probably need to protect the host userspace mappings, and also the places in KVM that kmap a GPA provided by the guest. Good point, I'll check this confused deputy attack. Extended KVM protections should indeed handle all ways to map guests' memory. I'm wondering if current VMMs would gracefully handle such new restrictions though. [ snip ] # Current limitations The main limitation of this patch series is the statically enforced permissions. This is not an issue for kernels without module but this needs to be addressed. Mechanisms that dynamically impact kernel executable memory are not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), and such code will need to be authenticated. Because the hypervisor is highly privileged and critical to the security of all the VMs, we don't want to implement a code authentication mechanism in the hypervisor itself but delegate this verification to something much less privileged. We are thinking of two ways to solve this: implement this verification in the VMM or spawn a dedicated special VM (similar to Windows's VBS). There are pros on cons to each approach: complexity, verification code ownership (guest's or VMM's), access to guest memory (i.e., confidential computing). The kernel often creates writable aliases in order to write to protected data (kernel text, etc). Some of this is done right as text is being first written out (alternatives for example), and some happens way later (jump labels, etc). So for verification, I wonder what stage you would be verifying? If you want to verify the end state, you would have to maintain knowledge in the verifier of all the touch-ups the kernel does. I think it would get very tricky. For now, in the static kernel case, all rodata and text GPA is restricted, so aliasing such memory in a writable way before or after the KVM enforcement would still restrict write access to this memory, which could be an issue but not a security one. Do you have such examples in mind? It also seems it will be a decent ask for the guest kernel to keep track of GPA permissions as well as normal virtual memory pemirssions, if this thing is not widely used. This would indeed be required to properly handle the dynamic cases. So I wondering if you could go in two directions with this: 1. Make this a feature only for super locked down kernels (no modules, etc). Forbid any configurations that might modify text. But eBPF is used for seccomp, so you might be turning off some security protections to get this. Good idea. For "super locked down kernels" :) , we should disable all kernel executable changes with the related kernel build configuration (e.g. eBPF JIT, kernel module, kprobes…) to make sure there is no such legitimate access. This looks like an acceptable initial feature. 2. Loosen the rules to allow the protections to not be so one-
Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity
On 24/05/2023 23:04, Trilok Soni wrote: On 5/5/2023 8:20 AM, Mickaël Salaün wrote: Hi, This patch series is a proof-of-concept that implements new KVM features (extended page tracking, MBEC support, CR pinning) and defines a new API to protect guest VMs. No VMM (e.g., Qemu) modification is required. The main idea being that kernel self-protection mechanisms should be delegated to a more privileged part of the system, hence the hypervisor. It is still the role of the guest kernel to request such restrictions according to its Only for the guest kernel images here? Why not for the host OS kernel? As explained in the Future work section, protecting the host would be useful, but that doesn't really fit with the KVM model. The Protected KVM project is a first step to help in this direction [11]. In a nutshell, KVM is close to a type-2 hypervisor, and the host kernel is also part of the hypervisor. Embedded devices w/ Android you have mentioned below supports the host OS as well it seems, right? What do you mean? Do we suggest that all the functionalities should be implemented in the Hypervisor (NS-EL2 for ARM) or even at Secure EL like Secure-EL1 (ARM). KVM runs in EL2. TrustZone is mainly used to enforce DRM, which means that we may not control the related code. This patch series is dedicated to hypervisor-enforced kernel integrity, then KVM. I am hoping that whatever we suggest the interface here from the Guest to the Hypervisor becomes the ABI right? Yes, hypercalls are part of the KVM ABI. # Current limitations The main limitation of this patch series is the statically enforced permissions. This is not an issue for kernels without module but this needs to be addressed. Mechanisms that dynamically impact kernel executable memory are not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), and such code will need to be authenticated. Because the hypervisor is highly privileged and critical to the security of all the VMs, we don't want to implement a code authentication mechanism in the hypervisor itself but delegate this verification to something much less privileged. We are thinking of two ways to solve this: implement this verification in the VMM or spawn a dedicated special VM (similar to Windows's VBS). There are pros on cons to each approach: complexity, verification code ownership (guest's or VMM's), access to guest memory (i.e., confidential computing). Do you foresee the performance regressions due to lot of tracking here? The performance impact of execution prevention should be negligible because once configured the hypervisor do nothing except catch illegitimate access attempts. Production kernels do have lot of tracepoints and we use it as feature in the GKI kernel for the vendor hooks implementation and in those cases every vendor driver is a module. As explained in this section, dynamic kernel modifications such as tracepoints or modules are not currently supported by this patch series. Handling tracepoints is possible but requires more work to define and check legitimate changes. This proposal is still useful for static kernels though. Separate VM further fragments this design and delegates more of it to proprietary solutions? What do you mean? KVM is not a proprietary solution. For dynamic checks, this would require code not run by KVM itself, but either the VMM or a dedicated VM. In this case, the dynamic authentication code could come from the guest VM or from the VMM itself. In the former case, it is more challenging from a security point of view but doesn't rely on external (proprietary) solution. In the latter case, open-source VMMs should implement the specification to provide the required service (e.g. check kernel module signature). The goal of the common API layer provided by this RFC is to share code as much as possible between different hypervisor backends. Do you have any performance numbers w/ current RFC? No, but the only hypervisor performance impact is at boot time and should be negligible. I'll try to get some numbers for the hardware-enforcement impact, but it should be negligible too.
Re: [PATCH v1 4/9] KVM: x86: Add new hypercall to set EPT permissions
On 05/05/2023 18:44, Sean Christopherson wrote: On Fri, May 05, 2023, Micka�l Sala�n wrote: Add a new KVM_HC_LOCK_MEM_PAGE_RANGES hypercall that enables a guest to set EPT permissions on a set of page ranges. IMO, manipulation of protections, both for memory (this patch) and CPU state (control registers in the next patch) should come from userspace. I have no objection to KVM providing plumbing if necessary, but I think userspace needs to to have full control over the actual state. By user space, do you mean the host user space or the guest user space? About the guest user space, I see several issues to delegate this kind of control: - These are restrictions only relevant to the kernel. - The threat model is to protect against user space as early as possible. - It would be more complex for no obvious gain. This patch series is an extension of the kernel self-protections mechanisms, and they are not configured by user space. One of the things that caused Intel's control register pinning series to stall out was how to handle edge cases like kexec() and reboot. Deferring to userspace means the kernel doesn't need to define policy, e.g. when to unprotect memory, and avoids questions like "should userspace be able to overwrite pinned control registers". The idea is to authenticate every changes. For kexec, the VMM (or something else) would have to authenticate the new kernel. Do you have something else in mind that could legitimately require such memory or CR changes? And like the confidential VM use case, keeping userspace in the loop is a big beneifit, e.g. the guest can't circumvent protections by coercing userspace into writing to protected memory . I don't understand this part. Are you talking about the host user space? How the guest could circumvent protections?
Re: [PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking
On 05/05/2023 18:28, Sean Christopherson wrote: On Fri, May 05, 2023, Micka�l Sala�n wrote: diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h index eb186bc57f6a..a7fb4ff888e6 100644 --- a/arch/x86/include/asm/kvm_page_track.h +++ b/arch/x86/include/asm/kvm_page_track.h @@ -3,6 +3,7 @@ #define _ASM_X86_KVM_PAGE_TRACK_H enum kvm_page_track_mode { + KVM_PAGE_TRACK_PREWRITE, Heh, just when I decide to finally kill off support for multiple modes[1] :-) My assessment from that changelog still holds true for this case: Drop "support" for multiple page-track modes, as there is no evidence that array-based and refcounted metadata is the optimal solution for other modes, nor is there any evidence that other use cases, e.g. for access-tracking, will be a good fit for the page-track machinery in general. E.g. one potential use case of access-tracking would be to prevent guest access to poisoned memory (from the guest's perspective). In that case, the number of poisoned pages is likely to be a very small percentage of the guest memory, and there is no need to reference count the number of access-tracking users, i.e. expanding gfn_track[] for a new mode would be grossly inefficient. And for poisoned memory, host userspace would also likely want to trap accesses, e.g. to inject #MC into the guest, and that isn't currently supported by the page-track framework. A better alternative for that poisoned page use case is likely a variation of the proposed per-gfn attributes overlay (linked), which would allow efficiently tracking the sparse set of poisoned pages, and by default would exit to userspace on access. Of particular relevance: - Using the page-track machinery is inefficient because the guest is likely going to write-protect a minority of its memory. And this select KVM_EXTERNAL_WRITE_TRACKING if KVM is particularly nasty because simply enabling HEKI in the Kconfig will cause KVM to allocate rmaps and gfn tracking. - There's no need to reference count the protection, i.e. 15 of the 16 bits of gfn_track are dead weight. - As proposed, adding a second "mode" would double the cost of gfn tracking. - Tying the protections to the memslots will create an impossible-to-maintain ABI because the protections will be lost if the owning memslot is deleted and recreated. - The page-track framework provides incomplete protection and will lead to an ongoing game of whack-a-mole, e.g. this patch catches the obvious cases by adding calls to kvm_page_track_prewrite(), but misses things like kvm_vcpu_map(). - The scaling and maintenance issues will only get worse if/when someone tries to support dropping read and/or execute permissions, e.g. for execute-only. - The code is x86-only, and is likely to stay that way for the foreseeable future. The proposed alternative is to piggyback the memory attributes implementation[2] that is being added (if all goes according to plan) for confidential VMs. This use case (dropping permissions) came up not too long ago[3], which is why I have a ready-made answer). I have no doubt that we'll need to solve performance and scaling issues with the memory attributes implementation, e.g. to utilize xarray multi-range support instead of storing information on a per-4KiB-page basis, but AFAICT, the core idea is sound. And a very big positive from a maintenance perspective is that any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should also benefit the other use case. [1] https://lore.kernel.org/all/20230311002258.852397-22-sea...@google.com [2] https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com [3] https://lore.kernel.org/all/y1a1i9vbj%2fpvm...@google.com I agree, I used this mechanism because it was easier at first to rely on a previous work, but while I was working on the MBEC support, I realized that it's not the optimal way to do it. I was thinking about using a new special EPT bit similar to EPT_SPTE_HOST_WRITABLE, but it may not be portable though. What do you think?
[PATCH v1 4/9] KVM: x86: Add new hypercall to set EPT permissions
Add a new KVM_HC_LOCK_MEM_PAGE_RANGES hypercall that enables a guest to set EPT permissions on a set of page ranges. This hypercall takes three arguments. The first contains the GPA pointing to an array of struct heki_pa_range. The second argument is the size of the array, not the number of elements. The third argument is for future proofness and is designed to contains optional flags (e.g. to change the array type), but must be zero for now. The struct heki_pa_range contains a GFN that starts the range and another that is the indicate the last (included) page. A bit field of attributes are tied to this range. The HEKI_ATTR_MEM_NOWRITE attribute is interpreted as a removal of the EPT write permission to deny any write access from the guest through its lifetime. We choose "nowrite" because "read-only" exclude execution, it follows a deny-list approach, and most importantly because it is an incremental addition to the status quo (i.e., everything is allowed from the TDP point of view). This is implemented thanks to the KVM_PAGE_TRACK_PREWRITE mode previously introduced. The page ranges recording is currently implemented with a static array of 16 elements to make it simple, but this mechanism will be dynamic in a follow-up. Define a kernel command line parameter "heki" to turn the feature on or off. By default, Heki is turned on. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Signed-off-by: Mickaël Salaün Link: https://lore.kernel.org/r/20230505152046.6575-5-...@digikod.net --- Documentation/virt/kvm/x86/hypercalls.rst | 17 +++ arch/x86/kvm/x86.c| 169 ++ include/linux/kvm_host.h | 13 ++ include/uapi/linux/kvm_para.h | 1 + virt/kvm/kvm_main.c | 4 + 5 files changed, 204 insertions(+) diff --git a/Documentation/virt/kvm/x86/hypercalls.rst b/Documentation/virt/kvm/x86/hypercalls.rst index 10db7924720f..0ec79cc77f53 100644 --- a/Documentation/virt/kvm/x86/hypercalls.rst +++ b/Documentation/virt/kvm/x86/hypercalls.rst @@ -190,3 +190,20 @@ the KVM_CAP_EXIT_HYPERCALL capability. Userspace must enable that capability before advertising KVM_FEATURE_HC_MAP_GPA_RANGE in the guest CPUID. In addition, if the guest supports KVM_FEATURE_MIGRATION_CONTROL, userspace must also set up an MSR filter to process writes to MSR_KVM_MIGRATION_CONTROL. + +9. KVM_HC_LOCK_MEM_PAGE_RANGES +-- + +:Architecture: x86 +:Status: active +:Purpose: Request memory page ranges to be restricted. + +- a0: physical address of a struct heki_pa_range array +- a1: size of the array +- a2: optional flags, must be 0 for now + +The hypercall lets a guest request memory permissions to be removed for itself, +identified with set of physical page ranges (GFNs). The HEKI_ATTR_MEM_NOWRITE +memory page range attribute forbids related modification to the guest. + +Returns 0 on success or a KVM error code otherwise. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fd05f42c9913..ffab64d08de3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -59,6 +59,7 @@ #include #include #include +#include #include @@ -9596,6 +9597,161 @@ static void kvm_sched_yield(struct kvm_vcpu *vcpu, unsigned long dest_id) return; } +#ifdef CONFIG_HEKI + +static int heki_page_track_add(struct kvm *const kvm, const gfn_t gfn, + const enum kvm_page_track_mode mode) +{ + struct kvm_memory_slot *slot; + int idx; + + BUILD_BUG_ON(!IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING)); + + idx = srcu_read_lock(>srcu); + slot = gfn_to_memslot(kvm, gfn); + if (!slot) { + srcu_read_unlock(>srcu, idx); + return -EINVAL; + } + + write_lock(>mmu_lock); + kvm_slot_page_track_add_page(kvm, slot, gfn, mode); + write_unlock(>mmu_lock); + srcu_read_unlock(>srcu, idx); + return 0; +} + +static bool +heki_page_track_prewrite(struct kvm_vcpu *const vcpu, const gpa_t gpa, +struct kvm_page_track_notifier_node *const node) +{ + const gfn_t gfn = gpa_to_gfn(gpa); + const struct kvm *const kvm = vcpu->kvm; + size_t i; + + /* Checks if it is our own tracked pages, or those of someone else. */ + for (i = 0; i < HEKI_GFN_MAX; i++) { + if (gfn >= kvm->heki_gfn_no_write[i].start && + gfn <= kvm->heki_gfn_no_write[i].end) + return false; + } + + return true; +} + +static int kvm_heki_init_vm(struct kvm *const kvm) +{ + struct kvm_page_track_notifier_node *const node = + kzalloc(sizeof(*node), GFP_KERNEL); + + if
[PATCH v1 6/9] KVM: x86: Add Heki hypervisor support
From: Madhavan T. Venkataraman Each supported hypervisor in x86 implements a struct x86_hyper_init to define the init functions for the hypervisor. Define a new init_heki() entry point in struct x86_hyper_init. Hypervisors that support Heki must define this init_heki() function. Call init_heki() of the chosen hypervisor in init_hypervisor_platform(). Create a heki_hypervisor structure that each hypervisor can fill with its data and functions. This will allow the Heki feature to work in a hypervisor agnostic way. Declare and initialize a "heki_hypervisor" structure for KVM so KVM can support Heki. Define the init_heki() function for KVM. In init_heki(), set the hypervisor field in the generic "heki" structure to the KVM "heki_hypervisor". After this point, generic Heki code can access the KVM Heki data and functions. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Co-developed-by: Mickaël Salaün Signed-off-by: Mickaël Salaün Signed-off-by: Madhavan T. Venkataraman Link: https://lore.kernel.org/r/20230505152046.6575-7-...@digikod.net --- arch/x86/include/asm/x86_init.h | 2 + arch/x86/kernel/cpu/hypervisor.c | 1 + arch/x86/kernel/kvm.c| 72 arch/x86/kernel/x86_init.c | 1 + arch/x86/kvm/Kconfig | 1 + virt/heki/Kconfig| 9 +++- virt/heki/heki.c | 6 --- 7 files changed, 85 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h index c1c8c581759d..0fc5041a66c6 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -119,6 +119,7 @@ struct x86_init_pci { * @msi_ext_dest_id: MSI supports 15-bit APIC IDs * @init_mem_mapping: setup early mappings during init_mem_mapping() * @init_after_bootmem:guest init after boot allocator is finished + * @init_heki: Hypervisor enforced kernel integrity */ struct x86_hyper_init { void (*init_platform)(void); @@ -127,6 +128,7 @@ struct x86_hyper_init { bool (*msi_ext_dest_id)(void); void (*init_mem_mapping)(void); void (*init_after_bootmem)(void); + void (*init_heki)(void); }; /** diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c index 553bfbfc3a1b..6085c8129e0c 100644 --- a/arch/x86/kernel/cpu/hypervisor.c +++ b/arch/x86/kernel/cpu/hypervisor.c @@ -106,4 +106,5 @@ void __init init_hypervisor_platform(void) x86_hyper_type = h->type; x86_init.hyper.init_platform(); + x86_init.hyper.init_heki(); } diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 1cceac5984da..e53cebdcf3ac 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -866,6 +867,45 @@ static void __init kvm_guest_init(void) hardlockup_detector_disable(); } +#ifdef CONFIG_HEKI + +static int kvm_protect_ranges(struct heki_pa_range *ranges, int num_ranges) +{ + size_t size; + long err; + + WARN_ON(in_interrupt()); + + size = sizeof(ranges[0]) * num_ranges; + err = kvm_hypercall3(KVM_HC_LOCK_MEM_PAGE_RANGES, __pa(ranges), size, 0); + if (WARN(err, "Failed to enforce memory protection: %ld\n", err)) + return err; + + return 0; +} + +extern unsigned long cr4_pinned_mask; + +/* + * TODO: Check SMP policy consistency, e.g. with + * this_cpu_read(cpu_tlbstate.cr4) + */ +static int kvm_lock_crs(void) +{ + unsigned long cr4; + int err; + + err = kvm_hypercall2(KVM_HC_LOCK_CR_UPDATE, 0, X86_CR0_WP); + if (err) + return err; + + cr4 = __read_cr4(); + err = kvm_hypercall2(KVM_HC_LOCK_CR_UPDATE, 4, cr4 & cr4_pinned_mask); + return err; +} + +#endif /* CONFIG_HEKI */ + static noinline uint32_t __kvm_cpuid_base(void) { if (boot_cpu_data.cpuid_level < 0) @@ -999,6 +1039,37 @@ static bool kvm_sev_es_hcall_finish(struct ghcb *ghcb, struct pt_regs *regs) } #endif +#ifdef CONFIG_HEKI + +static struct heki_hypervisor kvm_heki_hypervisor = { + .protect_ranges = kvm_protect_ranges, + .lock_crs = kvm_lock_crs, +}; + +static void kvm_init_heki(void) +{ + long err; + + if (!kvm_para_available()) + /* Cannot make KVM hypercalls. */ + return; + + err = kvm_hypercall3(KVM_HC_LOCK_MEM_PAGE_RANGES, -1, -1, -1); + if (err == -KVM_ENOSYS) + /* Ignores host. */ + return; + + heki.hypervisor = _heki_hypervisor; +} + +#else /* CONFIG_HEKI */ + +static void kvm_init_heki(void) +{ +} + +#endif /* CONFIG_HEKI */ + const __initconst str
[PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking
Add a new page tracking mode to deny a page update and throw a page fault to the guest. This is useful for KVM to be able to make some pages non-writable (not read-only because it doesn't imply execution restrictions), see the next Heki commits. This kind of synthetic kernel page fault needs to be handled by the guest, which is not currently the case, making it try again and again. This will be part of a follow-up patch series. Update emulator_read_write_onepage() to handle X86EMUL_CONTINUE and X86EMUL_PROPAGATE_FAULT. Update page_fault_handle_page_track() to call kvm_slot_page_track_is_active() whenever this is required for KVM_PAGE_TRACK_PREWRITE and KVM_PAGE_TRACK_WRITE, even if one tracker already returned true. Invert the return code semantic for read_emulate() and write_emulate(): - from 1=Ok 0=Error - to X86EMUL_* return codes (e.g. X86EMUL_CONTINUE == 0) Imported the prewrite page tracking support part originally written by Mihai Donțu, Marian Rotariu, and Ștefan Șicleru: https://lore.kernel.org/r/20211006173113.26445-27-ala...@bitdefender.com https://lore.kernel.org/r/20211006173113.26445-28-ala...@bitdefender.com Removed the GVA changes for page tracking, removed the X86EMUL_RETRY_INSTR case, and some emulation part for now. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Marian Rotariu Cc: Mihai Donțu Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Cc: Ștefan Șicleru Signed-off-by: Mickaël Salaün Link: https://lore.kernel.org/r/20230505152046.6575-3-...@digikod.net --- arch/x86/include/asm/kvm_page_track.h | 12 + arch/x86/kvm/mmu/mmu.c| 64 ++- arch/x86/kvm/mmu/page_track.c | 33 +- arch/x86/kvm/mmu/spte.c | 6 +++ arch/x86/kvm/x86.c| 27 +++ 5 files changed, 122 insertions(+), 20 deletions(-) diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h index eb186bc57f6a..a7fb4ff888e6 100644 --- a/arch/x86/include/asm/kvm_page_track.h +++ b/arch/x86/include/asm/kvm_page_track.h @@ -3,6 +3,7 @@ #define _ASM_X86_KVM_PAGE_TRACK_H enum kvm_page_track_mode { + KVM_PAGE_TRACK_PREWRITE, KVM_PAGE_TRACK_WRITE, KVM_PAGE_TRACK_MAX, }; @@ -22,6 +23,16 @@ struct kvm_page_track_notifier_head { struct kvm_page_track_notifier_node { struct hlist_node node; + /* +* It is called when guest is writing the write-tracked page +* and the write emulation didn't happened yet. +* +* @vcpu: the vcpu where the write access happened +* @gpa: the physical address written by guest +* @node: this nodet +*/ + bool (*track_prewrite)(struct kvm_vcpu *vcpu, gpa_t gpa, + struct kvm_page_track_notifier_node *node); /* * It is called when guest is writing the write-tracked page * and write emulation is finished at that time. @@ -73,6 +84,7 @@ kvm_page_track_register_notifier(struct kvm *kvm, void kvm_page_track_unregister_notifier(struct kvm *kvm, struct kvm_page_track_notifier_node *n); +bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa); void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, int bytes); void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot); diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 835426254e76..e5d1e241ff0f 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -793,9 +793,13 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp) slot = __gfn_to_memslot(slots, gfn); /* the non-leaf shadow pages are keeping readonly. */ - if (sp->role.level > PG_LEVEL_4K) - return kvm_slot_page_track_add_page(kvm, slot, gfn, - KVM_PAGE_TRACK_WRITE); + if (sp->role.level > PG_LEVEL_4K) { + kvm_slot_page_track_add_page(kvm, slot, gfn, +KVM_PAGE_TRACK_PREWRITE); + kvm_slot_page_track_add_page(kvm, slot, gfn, +KVM_PAGE_TRACK_WRITE); + return; + } kvm_mmu_gfn_disallow_lpage(slot, gfn); @@ -840,9 +844,13 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp) gfn = sp->gfn; slots = kvm_memslots_for_spte_role(kvm, sp->role); slot = __gfn_to_memslot(slots, gfn); - if (sp->role.level > PG_LEVEL_4K) - return kvm_slot_page_track_remove_page(kvm, slot, gfn, - KVM_PAGE_TRACK_WRITE); + if (sp->role.level > PG_LEVEL_4K) { + kvm_s
[PATCH v1 3/9] virt: Implement Heki common code
From: Madhavan T. Venkataraman Hypervisor Enforced Kernel Integrity (Heki) is a feature that will use the hypervisor to enhance guest virtual machine security. Configuration = Define the config variables for the feature. This feature depends on support from the architecture as well as the hypervisor. Enabling HEKI = Define a kernel command line parameter "heki" to turn the feature on or off. By default, Heki is on. Feature initialization == The linker script, vmlinux.lds.S, defines a number of sections that are loaded in kernel memory. Each of these sections has its own permissions. For instance, .text has HEKI_ATTR_MEM_EXEC | HEKI_ATTR_MEM_NOWRITE, and .rodata has HEKI_ATTR_MEM_NOWRITE. Define an architecture specific init function, heki_arch_init(). In this function, collect the ranges of all of the sections. These sections will be protected in the host page table with their respective permissions so that even if the guest kernel is compromised, their permissions cannot be changed. Define heki_early_init() to initialize the feature. For now, this function just checks if the feature is enabled and calls heki_arch_init(). Define heki_late_init() that protects the sections in the host page table. This needs hypervisor support which will be introduced in the future. This function is called at the end of kernel init. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Mickaël Salaün Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Signed-off-by: Madhavan T. Venkataraman Link: https://lore.kernel.org/r/20230505152046.6575-4-...@digikod.net --- Kconfig | 2 + arch/x86/Kconfig| 1 + arch/x86/include/asm/sections.h | 4 + arch/x86/kernel/setup.c | 49 include/linux/heki.h| 90 + init/main.c | 3 + virt/Makefile | 1 + virt/heki/Kconfig | 22 ++ virt/heki/Makefile | 3 + virt/heki/heki.c| 135 10 files changed, 310 insertions(+) create mode 100644 include/linux/heki.h create mode 100644 virt/heki/Kconfig create mode 100644 virt/heki/Makefile create mode 100644 virt/heki/heki.c diff --git a/Kconfig b/Kconfig index 745bc773f567..0c844d9bcb03 100644 --- a/Kconfig +++ b/Kconfig @@ -29,4 +29,6 @@ source "lib/Kconfig" source "lib/Kconfig.debug" +source "virt/heki/Kconfig" + source "Documentation/Kconfig" diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 3604074a878b..5cf5a7a97811 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -297,6 +297,7 @@ config X86 select FUNCTION_ALIGNMENT_4B imply IMA_SECURE_AND_OR_TRUSTED_BOOTif EFI select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE + select ARCH_SUPPORTS_HEKI if X86_64 config INSTRUCTION_DECODER def_bool y diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h index a6e8373a5170..42ef1e33b8a5 100644 --- a/arch/x86/include/asm/sections.h +++ b/arch/x86/include/asm/sections.h @@ -18,6 +18,10 @@ extern char __end_of_kernel_reserve[]; extern unsigned long _brk_start, _brk_end; +extern int __start_orc_unwind_ip[], __stop_orc_unwind_ip[]; +extern struct orc_entry __start_orc_unwind[], __stop_orc_unwind[]; +extern unsigned int orc_lookup[], orc_lookup_end[]; + static inline bool arch_is_kernel_initmem_freed(unsigned long addr) { /* diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 88188549647c..f0ddaf24ab63 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include @@ -850,6 +851,54 @@ static void __init x86_report_nx(void) } } +#ifdef CONFIG_HEKI + +/* + * Gather all of the statically defined sections so heki_late_init() can + * protect these sections in the host page table. + * + * The sections are defined under "SECTIONS" in vmlinux.lds.S + * Keep this array in sync with SECTIONS. + */ +struct heki_va_range __initdata heki_va_ranges[] = { + { + .va_start = _stext, + .va_end = _etext, + .attributes = HEKI_ATTR_MEM_NOWRITE | HEKI_ATTR_MEM_EXEC, + }, + { + .va_start = __start_rodata, + .va_end = __end_rodata, + .attributes = HEKI_ATTR_MEM_NOWRITE, + }, +#ifdef CONFIG_UNWINDER_ORC + { + .va_start = __start_orc_unwind_ip, + .va_end = __stop_orc_unwind_ip, + .attributes = HEKI_ATTR_MEM_NOWRITE, + }, + { + .va_start = __start_orc_unwind, + .va_end = __stop_orc
[PATCH v1 5/9] KVM: x86: Add new hypercall to lock control registers
This enables guests to lock their CR0 and CR4 registers with a subset of X86_CR0_WP, X86_CR4_SMEP, X86_CR4_SMAP, X86_CR4_UMIP, X86_CR4_FSGSBASE and X86_CR4_CET flags. The new KVM_HC_LOCK_CR_UPDATE hypercall takes two arguments. The first is to identify the control register, and the second is a bit mask to pin (i.e. mark as read-only). These register flags should already be pinned by Linux guests, but once compromised, this self-protection mechanism could be disabled, which is not the case with this dedicated hypercall. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Signed-off-by: Mickaël Salaün Link: https://lore.kernel.org/r/20230505152046.6575-6-...@digikod.net --- Documentation/virt/kvm/x86/hypercalls.rst | 15 + arch/x86/kernel/cpu/common.c | 2 +- arch/x86/kvm/vmx/vmx.c| 10 arch/x86/kvm/x86.c| 72 +++ arch/x86/kvm/x86.h| 16 + include/linux/kvm_host.h | 3 + include/uapi/linux/kvm_para.h | 1 + 7 files changed, 118 insertions(+), 1 deletion(-) diff --git a/Documentation/virt/kvm/x86/hypercalls.rst b/Documentation/virt/kvm/x86/hypercalls.rst index 0ec79cc77f53..8aa5d28986e3 100644 --- a/Documentation/virt/kvm/x86/hypercalls.rst +++ b/Documentation/virt/kvm/x86/hypercalls.rst @@ -207,3 +207,18 @@ identified with set of physical page ranges (GFNs). The HEKI_ATTR_MEM_NOWRITE memory page range attribute forbids related modification to the guest. Returns 0 on success or a KVM error code otherwise. + +10. KVM_HC_LOCK_CR_UPDATE +- + +:Architecture: x86 +:Status: active +:Purpose: Request some control registers to be restricted. + +- a0: identify a control register +- a1: bit mask to make some flags read-only + +The hypercall lets a guest request control register flags to be pinned for +itself. + +Returns 0 on success or a KVM error code otherwise. diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index f3cc7699e1e1..dd89379fe5ac 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -413,7 +413,7 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c) } /* These bits should not change their value after CPU init is finished. */ -static const unsigned long cr4_pinned_mask = +const unsigned long cr4_pinned_mask = X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP | X86_CR4_FSGSBASE | X86_CR4_CET; static DEFINE_STATIC_KEY_FALSE_RO(cr_pinning); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 9870db887a62..931688edc8eb 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -3162,6 +3162,11 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) struct vcpu_vmx *vmx = to_vmx(vcpu); unsigned long hw_cr0, old_cr0_pg; u32 tmp; + int res; + + res = heki_check_cr(vcpu->kvm, 0, cr0); + if (res) + return; old_cr0_pg = kvm_read_cr0_bits(vcpu, X86_CR0_PG); @@ -3323,6 +3328,11 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) * this bit, even if host CR4.MCE == 0. */ unsigned long hw_cr4; + int res; + + res = heki_check_cr(vcpu->kvm, 4, cr4); + if (res) + return; hw_cr4 = (cr4_read_shadow() & X86_CR4_MCE) | (cr4 & ~X86_CR4_MCE); if (is_unrestricted_guest(vcpu)) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ffab64d08de3..a529455359ac 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7927,11 +7927,77 @@ static unsigned long emulator_get_cr(struct x86_emulate_ctxt *ctxt, int cr) return value; } +#ifdef CONFIG_HEKI + +extern unsigned long cr4_pinned_mask; + +static int heki_lock_cr(struct kvm *const kvm, const unsigned long cr, + unsigned long pin) +{ + if (!pin) + return -KVM_EINVAL; + + switch (cr) { + case 0: + /* Cf. arch/x86/kernel/cpu/common.c */ + if (!(pin & X86_CR0_WP)) + return -KVM_EINVAL; + + if ((read_cr0() & pin) != pin) + return -KVM_EINVAL; + + atomic_long_or(pin, >heki_pinned_cr0); + return 0; + case 4: + /* Checks for irrelevant bits. */ + if ((pin & cr4_pinned_mask) != pin) + return -KVM_EINVAL; + + /* Ignores bits not present in host. */ + pin &= __read_cr4(); + atomic_long_or(pin, >heki_pinned_cr4); + return 0; + } + return -KVM_EINVAL; +} + +int heki_check_cr(const struct kvm *const kvm, const unsigned long cr, +
[PATCH v1 1/9] KVM: x86: Add kvm_x86_ops.fault_gva()
This function is needed for kvm_mmu_page_fault() to create synthetic page faults. Code originally written by Mihai Donțu and Nicușor Cîțu: https://lore.kernel.org/r/20211006173113.26445-18-ala...@bitdefender.com Renamed fault_gla() to fault_gva() and use the new EPT_VIOLATION_GVA_IS_VALID. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Co-developed-by: Mihai Donțu Signed-off-by: Mihai Donțu Co-developed-by: Nicușor Cîțu Signed-off-by: Nicușor Cîțu Signed-off-by: Mickaël Salaün Link: https://lore.kernel.org/r/20230505152046.6575-2-...@digikod.net --- arch/x86/include/asm/kvm-x86-ops.h | 1 + arch/x86/include/asm/kvm_host.h| 2 ++ arch/x86/kvm/svm/svm.c | 9 + arch/x86/kvm/vmx/vmx.c | 10 ++ 4 files changed, 22 insertions(+) diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h index abccd51dcfca..b761182a9444 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h +++ b/arch/x86/include/asm/kvm-x86-ops.h @@ -131,6 +131,7 @@ KVM_X86_OP(msr_filter_changed) KVM_X86_OP(complete_emulated_msr) KVM_X86_OP(vcpu_deliver_sipi_vector) KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons); +KVM_X86_OP(fault_gva) #undef KVM_X86_OP #undef KVM_X86_OP_OPTIONAL diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 6aaae18f1854..f319bcdeb8bd 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1706,6 +1706,8 @@ struct kvm_x86_ops { * Returns vCPU specific APICv inhibit reasons */ unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu); + + u64 (*fault_gva)(struct kvm_vcpu *vcpu); }; struct kvm_x86_nested_ops { diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 9a194aa1a75a..8b47b38aaf7f 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4700,6 +4700,13 @@ static int svm_vm_init(struct kvm *kvm) return 0; } +static u64 svm_fault_gva(struct kvm_vcpu *vcpu) +{ + const struct vcpu_svm *svm = to_svm(vcpu); + + return svm->vcpu.arch.cr2 ? svm->vcpu.arch.cr2 : ~0ull; +} + static struct kvm_x86_ops svm_x86_ops __initdata = { .name = "kvm_amd", @@ -4826,6 +4833,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = { .vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector, .vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons, + + .fault_gva = svm_fault_gva, }; /* diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 7eec0226d56a..9870db887a62 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -8067,6 +8067,14 @@ static void vmx_vm_destroy(struct kvm *kvm) free_pages((unsigned long)kvm_vmx->pid_table, vmx_get_pid_table_order(kvm)); } +static u64 vmx_fault_gva(struct kvm_vcpu *vcpu) +{ + if (vcpu->arch.exit_qualification & EPT_VIOLATION_GVA_IS_VALID) + return vmcs_readl(GUEST_LINEAR_ADDRESS); + + return ~0ull; +} + static struct kvm_x86_ops vmx_x86_ops __initdata = { .name = "kvm_intel", @@ -8204,6 +8212,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = { .complete_emulated_msr = kvm_complete_insn_gp, .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector, + + .fault_gva = vmx_fault_gva, }; static unsigned int vmx_handle_intel_pt_intr(void) -- 2.40.1
[RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity
]). Extending register pinning (e.g., MSRs). Being able to protect nested guests might be possible but we need to figure out the potential security implications. Protecting the host would be useful, but that doesn't really fit with the KVM model. The Protected KVM project is a first step to help in this direction [11]. We only tested this with an Intel CPU, but this approach should work the same with an AMD CPU starting with the Zen 2 generation and their Guest Mode Execute Trap (GMET) capability. We also kept some TODOs to highlight missing checks and code sharing issues, and some pr_warn() calls to help understand how it works. Tests need to be improved (e.g., invalid hypercall arguments). We'll present this work at the Linux Security Summit North America next week. [1] https://lore.kernel.org/all/20211006173113.26445-1-ala...@bitdefender.com/ [2] https://www.linux-kvm.org/images/7/72/KVMForum2017_Introspection.pdf [3] https://lore.kernel.org/all/20200617190757.27081-1-john.s.ander...@intel.com/ [4] https://github.com/intel/vbh [5] https://sched.co/TmwN [6] https://sched.co/eE3f [7] https://lore.kernel.org/all/20200501185147.208192-1-yua...@google.com/ [8] https://sched.co/eE4F [9] https://lore.kernel.org/kvm/20191003212400.31130-1-rick.p.edgeco...@intel.com/ [10] https://lpc.events/event/4/contributions/283/ [11] https://sched.co/eE24 Please reach out to us by replying to this thread, we're looking for people to join and collaborate on this project! Regards, Madhavan T. Venkataraman (2): virt: Implement Heki common code KVM: x86: Add Heki hypervisor support Mickaël Salaün (7): KVM: x86: Add kvm_x86_ops.fault_gva() KVM: x86/mmu: Add support for prewrite page tracking KVM: x86: Add new hypercall to set EPT permissions KVM: x86: Add new hypercall to lock control registers KVM: VMX: Add MBEC support KVM: x86/mmu: Enable guests to lock themselves thanks to MBEC virt: Add Heki KUnit tests Documentation/virt/kvm/x86/hypercalls.rst | 34 +++ Kconfig | 2 + arch/x86/Kconfig | 1 + arch/x86/include/asm/kvm-x86-ops.h| 1 + arch/x86/include/asm/kvm_host.h | 2 + arch/x86/include/asm/kvm_page_track.h | 12 + arch/x86/include/asm/sections.h | 4 + arch/x86/include/asm/vmx.h| 11 +- arch/x86/include/asm/x86_init.h | 2 + arch/x86/kernel/cpu/common.c | 2 +- arch/x86/kernel/cpu/hypervisor.c | 1 + arch/x86/kernel/kvm.c | 72 + arch/x86/kernel/setup.c | 49 +++ arch/x86/kernel/x86_init.c| 1 + arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/mmu.h| 3 +- arch/x86/kvm/mmu/mmu.c| 105 ++- arch/x86/kvm/mmu/mmutrace.h | 11 +- arch/x86/kvm/mmu/page_track.c | 33 +- arch/x86/kvm/mmu/paging_tmpl.h| 16 +- arch/x86/kvm/mmu/spte.c | 29 +- arch/x86/kvm/mmu/spte.h | 15 +- arch/x86/kvm/mmu/tdp_mmu.c| 73 + arch/x86/kvm/mmu/tdp_mmu.h| 4 + arch/x86/kvm/svm/svm.c| 9 + arch/x86/kvm/vmx/capabilities.h | 7 + arch/x86/kvm/vmx/nested.c | 7 + arch/x86/kvm/vmx/vmx.c| 48 ++- arch/x86/kvm/vmx/vmx.h| 1 + arch/x86/kvm/x86.c| 352 +- arch/x86/kvm/x86.h| 23 ++ include/linux/heki.h | 90 ++ include/linux/kvm_host.h | 20 ++ include/uapi/linux/kvm_para.h | 2 + init/main.c | 3 + virt/Makefile | 1 + virt/heki/Kconfig | 41 +++ virt/heki/Makefile| 3 + virt/heki/heki.c | 321 virt/kvm/kvm_main.c | 5 + 40 files changed, 1377 insertions(+), 40 deletions(-) create mode 100644 include/linux/heki.h create mode 100644 virt/heki/Kconfig create mode 100644 virt/heki/Makefile create mode 100644 virt/heki/heki.c base-commit: c9c3395d5e3dcc6daee66c6908354d47bf98cb0c -- 2.40.1
[PATCH v1 7/9] KVM: VMX: Add MBEC support
This changes add support for VMX_FEATURE_MODE_BASED_EPT_EXEC (named ept_mode_based_exec in /proc/cpuinfo and MBEC elsewhere), which enables to separate EPT execution bits for supervisor vs. user. It transforms the semantic of VMX_EPT_EXECUTABLE_MASK from a global execution to a kernel execution, and use the VMX_EPT_USER_EXECUTABLE_MASK bit to identify user execution. The main use case is to be able to restrict kernel execution while ignoring user space execution from the hypervisor point of view. Indeed, user space execution can already be restricted by the guest kernel. This change enables MBEC but doesn't change the default configuration, which is to allow execution for all guest memory. However, the next commit levages MBEC to restrict kernel memory pages. MBEC can be configured with the new "enable_mbec" module parameter, set to true by default. However, MBEC is disable for L1 and L2 for now. Replace EPT_VIOLATION_RWX_MASK (3 bits) with 4 dedicated EPT_VIOLATION_READ, EPT_VIOLATION_WRITE, EPT_VIOLATION_KERNEL_INSTR, and EPT_VIOLATION_USER_INSTR bits. >From the Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3C (System Programming Guide), Part 3: SECONDARY_EXEC_MODE_BASED_EPT_EXEC (bit 22): If either the "unrestricted guest" VM-execution control or the "mode-based execute control for EPT" VM-execution control is 1, the "enable EPT" VM-execution control must also be 1. EPT_VIOLATION_KERNEL_INSTR_BIT (bit 5): The logical-AND of bit 2 in the EPT paging-structure entries used to translate the guest-physical address of the access causing the EPT violation. If the "mode-based execute control for EPT" VM-execution control is 0, this indicates whether the guest-physical address was executable. If that control is 1, this indicates whether the guest-physical address was executable for supervisor-mode linear addresses. EPT_VIOLATION_USER_INSTR_BIT (bit 6): If the "mode-based execute control" VM-execution control is 0, the value of this bit is undefined. If that control is 1, this bit is the logical-AND of bit 10 in the EPT paging-structures entries used to translate the guest-physical address of the access causing the EPT violation. In this case, it indicates whether the guest-physical address was executable for user-mode linear addresses. PT_USER_EXEC_MASK (bit 10): Execute access for user-mode linear addresses. If the "mode-based execute control for EPT" VM-execution control is 1, indicates whether instruction fetches are allowed from user-mode linear addresses in the 512-GByte region controlled by this entry. If that control is 0, this bit is ignored. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Signed-off-by: Mickaël Salaün Link: https://lore.kernel.org/r/20230505152046.6575-8-...@digikod.net --- arch/x86/include/asm/vmx.h | 11 +-- arch/x86/kvm/mmu.h | 3 ++- arch/x86/kvm/mmu/mmu.c | 6 +- arch/x86/kvm/mmu/paging_tmpl.h | 16 ++-- arch/x86/kvm/mmu/spte.c | 4 +++- arch/x86/kvm/vmx/capabilities.h | 7 +++ arch/x86/kvm/vmx/nested.c | 7 +++ arch/x86/kvm/vmx/vmx.c | 28 +--- arch/x86/kvm/vmx/vmx.h | 1 + 9 files changed, 73 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 498dc600bd5c..452e7d153832 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -511,6 +511,7 @@ enum vmcs_field { #define VMX_EPT_IPAT_BIT (1ull << 6) #define VMX_EPT_ACCESS_BIT (1ull << 8) #define VMX_EPT_DIRTY_BIT (1ull << 9) +#define VMX_EPT_USER_EXECUTABLE_MASK (1ull << 10) #define VMX_EPT_RWX_MASK(VMX_EPT_READABLE_MASK | \ VMX_EPT_WRITABLE_MASK | \ VMX_EPT_EXECUTABLE_MASK) @@ -556,13 +557,19 @@ enum vm_entry_failure_code { #define EPT_VIOLATION_ACC_READ_BIT 0 #define EPT_VIOLATION_ACC_WRITE_BIT1 #define EPT_VIOLATION_ACC_INSTR_BIT2 -#define EPT_VIOLATION_RWX_SHIFT3 +#define EPT_VIOLATION_READ_BIT 3 +#define EPT_VIOLATION_WRITE_BIT4 +#define EPT_VIOLATION_KERNEL_INSTR_BIT 5 +#define EPT_VIOLATION_USER_INSTR_BIT 6 #define EPT_VIOLATION_GVA_IS_VALID_BIT 7 #define EPT_VIOLATION_GVA_TRANSLATED_BIT 8 #define EPT_VIOLATION_ACC_READ (1 << EPT_VIOLATION_ACC_READ_BIT) #define EPT_VIOLATION_ACC_WRITE(1 << EPT_VIOLATION_ACC_WRITE_BIT) #define EPT_VIOLATION_ACC_INSTR(1 << EPT_VIOLATION_ACC_INSTR_BIT) -#define EPT_V
[PATCH v1 8/9] KVM: x86/mmu: Enable guests to lock themselves thanks to MBEC
This changes enable to enforce a deny-by-default execution security policy for guest kernels, leveraged by the Heki implementation. Create synthetic page faults when an access is denied by Heki. This kind of kernel page fault needs to be handled by guests, which is not currently the case, making it try again and again, but we are working to calm down such guests by teaching them how to handle such page faults. The MMU tracepoints are updated to reflect the difference between kernel and user space executions. kvm_heki_fix_all_ept_exec_perm() walks through all guest memory pages to set the configured default execution permissions (i.e. only allow configured executabel memory pages). The struct heki_mem_range's attribute field now understand HEKI_ATTR_MEM_EXEC, which allows the related kernel sections to be executable, and deny any other kernel memory from being executable for the whole lifetime of the guest. This obviously can only work with static kernels and we are exploring ways to handle authenticated and dynamic kernel memory permission updates. If the host doesn't have MBEC enabled, the KVM_HC_LOCK_MEM_PAGE_RANGES hypecall will return -KVM_EOPNOTSUPP and might only apply the previous ranges, if any. This is useful to develop this RFC and make sure execution restrictions are enforced (and not silently ignored), but this behavior might change in a future patch series. Guest kernels could check for MBEC support to not use the HEKI_ATTR_MEM_EXEC attribute. The number of configurable memory ranges per guest is 16 for now. This will change with a follow-up. There are currently some pr_warn() calls to make it easy to test this code. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Signed-off-by: Mickaël Salaün Link: https://lore.kernel.org/r/20230505152046.6575-9-...@digikod.net --- Documentation/virt/kvm/x86/hypercalls.rst | 4 +- arch/x86/kvm/mmu/mmu.c| 35 - arch/x86/kvm/mmu/mmutrace.h | 11 ++- arch/x86/kvm/mmu/spte.c | 19 - arch/x86/kvm/mmu/spte.h | 15 +++- arch/x86/kvm/mmu/tdp_mmu.c| 73 ++ arch/x86/kvm/mmu/tdp_mmu.h| 4 + arch/x86/kvm/x86.c| 90 ++- arch/x86/kvm/x86.h| 7 ++ include/linux/kvm_host.h | 4 + virt/kvm/kvm_main.c | 1 + 11 files changed, 250 insertions(+), 13 deletions(-) diff --git a/Documentation/virt/kvm/x86/hypercalls.rst b/Documentation/virt/kvm/x86/hypercalls.rst index 8aa5d28986e3..5accf5f6de13 100644 --- a/Documentation/virt/kvm/x86/hypercalls.rst +++ b/Documentation/virt/kvm/x86/hypercalls.rst @@ -204,7 +204,9 @@ must also set up an MSR filter to process writes to MSR_KVM_MIGRATION_CONTROL. The hypercall lets a guest request memory permissions to be removed for itself, identified with set of physical page ranges (GFNs). The HEKI_ATTR_MEM_NOWRITE -memory page range attribute forbids related modification to the guest. +memory page range attribute forbids related modification to the guest. The +HEKI_ATTR_MEM_EXEC attribute allows execution on the specified pages while +removing it for all the others. Returns 0 on success or a KVM error code otherwise. diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index a47e63217eb8..56a8bcac1b82 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3313,7 +3313,7 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, static bool is_access_allowed(struct kvm_page_fault *fault, u64 spte) { if (fault->exec) - return is_executable_pte(spte); + return is_executable_pte(spte, !fault->user); if (fault->write) return is_writable_pte(spte); @@ -5602,6 +5602,39 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root.hpa))) return RET_PF_RETRY; + /* Skips real page faults if not needed. */ + if ((error_code & PFERR_FETCH_MASK) && + !kvm_heki_is_exec_allowed(vcpu, cr2_or_gpa)) { + /* +* TODO: To avoid kvm_heki_is_exec_allowed() call, check +* enable_mbec and EPT_VIOLATION_KERNEL_INSTR, see +* handle_ept_violation(). +*/ + struct x86_exception fault = { + .vector = PF_VECTOR, + .error_code_valid = true, + .error_code = error_code, + .nested_page_fault = false, + /* +* TODO: This kind of kernel page fault needs to be handled by +
[PATCH v1 9/9] virt: Add Heki KUnit tests
This adds a new CONFIG_HEKI_TEST option to run tests at boot. Indeed, because this patch series forbids the loading of kernel modules after the boot, we need to make built-in tests. Furthermore, because we use some symbols not exported to modules (e.g., kernel_set_to_readonly) this could not work as modules. To run these tests, we need to boot the kernel with the heki_test=N boot argument with N selecting a specific test: 1. heki_test_cr_disable_smep: Check CR pinning and try to disable SMEP. 2. heki_test_write_to_const: Check .rodata (const) protection. 3. heki_test_write_to_ro_after_init: Check __ro_after_init protection. 4. heki_test_exec: Check non-executable kernel memory. This way to select tests should not be required when the kernel will properly handle the triggered synthetic page faults. For now, these page faults make the kernel loop. All these tests temporarily disable the related kernel self-protections and should then failed if Heki doesn't protect the kernel. They are verbose to make it easier to understand what is going on. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Signed-off-by: Mickaël Salaün Link: https://lore.kernel.org/r/20230505152046.6575-10-...@digikod.net --- virt/heki/Kconfig | 12 +++ virt/heki/heki.c | 194 +- 2 files changed, 205 insertions(+), 1 deletion(-) diff --git a/virt/heki/Kconfig b/virt/heki/Kconfig index 96f18ce03013..806981f2b22d 100644 --- a/virt/heki/Kconfig +++ b/virt/heki/Kconfig @@ -27,3 +27,15 @@ config HYPERVISOR_SUPPORTS_HEKI A hypervisor should select this when it can successfully build and run with CONFIG_HEKI. That is, it should provide all of the hypervisor support required for the Heki feature. + +config HEKI_TEST + bool "Tests for Heki" if !KUNIT_ALL_TESTS + depends on HEKI && KUNIT=y + default KUNIT_ALL_TESTS + help + Run Heki tests at runtime according to the heki_test=N boot + parameter, with N identifying the test to run (between 1 and 4). + + Before launching the init process, the system might not respond + because of unhandled kernel page fault. This will be fixed in a + next patch series. diff --git a/virt/heki/heki.c b/virt/heki/heki.c index 142b5dc98a2f..361e7734e950 100644 --- a/virt/heki/heki.c +++ b/virt/heki/heki.c @@ -5,11 +5,13 @@ * Copyright © 2023 Microsoft Corporation */ +#include #include #include #include #include #include +#include #include #include @@ -78,13 +80,201 @@ void __init heki_early_init(void) heki_arch_init(); } +#ifdef CONFIG_HEKI_TEST + +/* Heki test data */ + +/* Takes two pages to not change permission of other read-only pages. */ +const char heki_test_const_buf[PAGE_SIZE * 2] = {}; +char heki_test_ro_after_init_buf[PAGE_SIZE * 2] __ro_after_init = {}; + +long heki_test_exec_data(long); +void _test_exec_data_end(void); + +/* Used to test ROP execution against the .rodata section. */ +/* clang-format off */ +asm( +".pushsection .rodata;" // NOT .text section +".global heki_test_exec_data;" +".type heki_test_exec_data, @function;" +"heki_test_exec_data:" +ASM_ENDBR +"movq %rdi, %rax;" +"inc %rax;" +ASM_RET +".size heki_test_exec_data, .-heki_test_exec_data;" +"_test_exec_data_end:" +".popsection"); +/* clang-format on */ + +static void heki_test_cr_disable_smep(struct kunit *test) +{ + unsigned long cr4; + + /* SMEP should be initially enabled. */ + KUNIT_ASSERT_TRUE(test, __read_cr4() & X86_CR4_SMEP); + + kunit_warn(test, + "Starting control register pinning tests with SMEP check\n"); + + /* +* Trying to disable SMEP, bypassing kernel self-protection by not +* using cr4_clear_bits(X86_CR4_SMEP). +*/ + cr4 = __read_cr4() & ~X86_CR4_SMEP; + asm volatile("mov %0,%%cr4" : "+r"(cr4) : : "memory"); + + /* SMEP should still be enabled. */ + KUNIT_ASSERT_TRUE(test, __read_cr4() & X86_CR4_SMEP); +} + +static inline void print_addr(struct kunit *test, const char *const buf_name, + void *const buf) +{ + const pte_t pte = *virt_to_kpte((unsigned long)buf); + const phys_addr_t paddr = slow_virt_to_phys(buf); + bool present = pte_flags(pte) & (_PAGE_PRESENT); + bool accessible = pte_accessible(_mm, pte); + + kunit_warn( + test, + "%s vaddr:%llx paddr:%llx exec:%d write:%d present:%d accessible:%d\n", + buf_name, (unsigned long long)buf, paddr, !!pte_exec(pte), + !!pte_write(pte), present, acces