Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation

2024-06-03 Thread Mickaël Salaün
On Wed, May 15, 2024 at 01:32:24PM -0700, Sean Christopherson wrote:
> On Tue, May 14, 2024, Mickaël Salaün wrote:
> > On Fri, May 10, 2024 at 10:07:00AM +, Nicolas Saenz Julienne wrote:
> > > Development happens
> > > https://github.com/vianpl/{linux,qemu,kvm-unit-tests} and the vsm-next
> > > branch, but I'd advice against looking into it until we add some order
> > > to the rework. Regardless, feel free to get in touch.
> > 
> > Thanks for the update.
> > 
> > Could we schedule a PUCK meeting to synchronize and help each other?
> > What about June 12?
> 
> June 12th works on my end.

Can you please send an invite?

 Mickaël



Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation

2024-05-14 Thread Mickaël Salaün
On Fri, May 10, 2024 at 10:07:00AM +, Nicolas Saenz Julienne wrote:
> On Tue May 7, 2024 at 4:16 PM UTC, Sean Christopherson wrote:
> > > If yes, that would indeed require a *lot* of work for something we're not
> > > sure will be accepted later on.
> >
> > Yes and no.  The AWS folks are pursuing VSM support in KVM+QEMU, and SVSM 
> > support
> > is trending toward the paired VM+vCPU model.  IMO, it's entirely feasible to
> > design KVM support such that much of the development load can be shared 
> > between
> > the projects.  And having 2+ use cases for a feature (set) makes it _much_ 
> > more
> > likely that the feature(s) will be accepted.
> 
> Since Sean mentioned our VSM efforts, a small update. We were able to
> validate the concept of one KVM VM per VTL as discussed in LPC. Right
> now only for single CPU guests, but are in the late stages of bringing
> up MP support. The resulting KVM code is small, and most will be
> uncontroversial (I hope). If other obligations allow it, we plan on
> having something suitable for review in the coming months.

Looks good!

> 
> Our implementation aims to implement all the VSM spec necessary to run
> with Microsoft Credential Guard. But note that some aspects necessary
> for HVCI are not covered, especially the ones that depend on MBEC
> support, or some categories of secure intercepts.

We already implemented support for MBEC, so that should not be an issue.
We just need to find the best interface to configure it.

> 
> Development happens
> https://github.com/vianpl/{linux,qemu,kvm-unit-tests} and the vsm-next
> branch, but I'd advice against looking into it until we add some order
> to the rework. Regardless, feel free to get in touch.

Thanks for the update.

Could we schedule a PUCK meeting to synchronize and help each other?
What about June 12?



Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation

2024-05-14 Thread Mickaël Salaün
On Tue, May 07, 2024 at 09:16:06AM -0700, Sean Christopherson wrote:
> On Tue, May 07, 2024, Mickaël Salaün wrote:
> > > Actually, potential bad/crazy idea.  Why does the _host_ need to define 
> > > policy?
> > > Linux already knows what assets it wants to (un)protect and when.  What's 
> > > missing
> > > is a way for the guest kernel to effectively deprivilege and 
> > > re-authenticate
> > > itself as needed.  We've been tossing around the idea of paired VMs+vCPUs 
> > > to
> > > support VTLs and SEV's VMPLs, what if we usurped/piggybacked those ideas, 
> > > with a
> > > bit of pKVM mixed in?
> > > 
> > > Borrowing VTL terminology, where VTL0 is the least privileged, userspace 
> > > launches
> > > the VM at VTL0.  At some point, the guest triggers the deprivileging 
> > > sequence and
> > > userspace creates VTL1.  Userpace also provides a way for VTL0 restrict 
> > > access to
> > > its memory, e.g. to effectively make the page tables for the kernel's 
> > > direct map
> > > writable only from VTL1, to make kernel text RO (or XO), etc.  And VTL0 
> > > could then
> > > also completely remove its access to code that changes CR0/CR4.
> > > 
> > > It would obviously require a _lot_ more upfront work, e.g. to isolate the 
> > > kernel
> > > text that modifies CR0/CR4 so that it can be removed from VTL0, but that 
> > > should
> > > be doable with annotations, e.g. tag relevant functions with __magic or 
> > > whatever,
> > > throw them in a dedicated section, and then free/protect the section(s) 
> > > at the
> > > appropriate time.
> > > 
> > > KVM would likely need to provide the ability to switch VTLs (or whatever 
> > > they get
> > > called), and host userspace would need to provide a decent amount of the 
> > > backend
> > > mechanisms and "core" policies, e.g. to manage VTL0 memory, teardown 
> > > (turn off?)
> > > VTL1 on kexec(), etc.  But everything else could live in the guest kernel 
> > > itself.
> > > E.g. to have CR pinning play nice with kexec(), toss the relevant kexec() 
> > > code into
> > > VTL1.  That way VTL1 can verify the kexec() target and tear itself down 
> > > before
> > > jumping into the new kernel. 
> > > 
> > > This is very off the cuff and have-wavy, e.g. I don't have much of an 
> > > idea what
> > > it would take to harden kernel text patching, but keeping the policy in 
> > > the guest
> > > seems like it'd make everything more tractable than trying to define an 
> > > ABI
> > > between Linux and a VMM that is rich and flexible enough to support all 
> > > the
> > > fancy things Linux does (and will do in the future).
> > 
> > Yes, we agree that the guest needs to manage its own policy.  That's why
> > we implemented Heki for KVM this way, but without VTLs because KVM
> > doesn't support them.
> > 
> > To sum up, is the VTL approach the only one that would be acceptable for
> > KVM?  
> 
> Heh, that's not a question you want to be asking.  You're effectively asking 
> me
> to make an authorative, "final" decision on a topic which I am only passingly
> familiar with.
> 
> But since you asked it... :-)  Probably?
> 
> I see a lot of advantages to a VTL/VSM-like approach:
> 
>  1. Provides Linux-as-a guest the flexibility it needs to meaningfully advance
> its security, with the least amount of policy built into the guest/host 
> ABI.
> 
>  2. Largely decouples guest policy from the host, i.e. should allow the guest 
> to
> evolve/update it's policy without needing to coordinate changes with the 
> host.
> 
>  3. The KVM implementation can be generic enough to be reusable for other 
> features.
> 
>  4. Other groups are already working on VTL-like support in KVM, e.g. for VSM
> itself, and potentially for VMPL/SVSM support.
> 
> IMO, #2 is a *huge* selling point.  Not having to coordinate changes across
> multiple code bases and/or organizations and/or maintainers is a big win for
> velocity, long term maintenance, and probably the very viability of HEKI.

Agree, this is our goal.

> 
> Providing the guest with the tools to define and implement its own policy 
> means
> end users don't have to way for some third party, e.g. CSPs, to deploy the
> accompanying host-side changes, because there are no host-side changes.
> 
> And encapsulating everything in the guest drastically re

Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation

2024-05-07 Thread Mickaël Salaün
On Mon, May 06, 2024 at 06:34:53PM GMT, Sean Christopherson wrote:
> On Mon, May 06, 2024, Mickaël Salaün wrote:
> > On Fri, May 03, 2024 at 07:03:21AM GMT, Sean Christopherson wrote:
> > > > ---
> > > > 
> > > > Changes since v1:
> > > > * New patch. Making user space aware of Heki properties was requested by
> > > >   Sean Christopherson.
> > > 
> > > No, I suggested having userspace _control_ the pinning[*], not merely be 
> > > notified
> > > of pinning.
> > > 
> > >  : IMO, manipulation of protections, both for memory (this patch) and CPU 
> > > state
> > >  : (control registers in the next patch) should come from userspace.  I 
> > > have no
> > >  : objection to KVM providing plumbing if necessary, but I think 
> > > userspace needs to
> > >  : to have full control over the actual state.
> > >  : 
> > >  : One of the things that caused Intel's control register pinning series 
> > > to stall
> > >  : out was how to handle edge cases like kexec() and reboot.  Deferring 
> > > to userspace
> > >  : means the kernel doesn't need to define policy, e.g. when to unprotect 
> > > memory,
> > >  : and avoids questions like "should userspace be able to overwrite 
> > > pinned control
> > >  : registers".
> > >  : 
> > >  : And like the confidential VM use case, keeping userspace in the loop 
> > > is a big
> > >  : beneifit, e.g. the guest can't circumvent protections by coercing 
> > > userspace into
> > >  : writing to protected memory.
> > > 
> > > I stand by that suggestion, because I don't see a sane way to handle 
> > > things like
> > > kexec() and reboot without having a _much_ more sophisticated policy than 
> > > would
> > > ever be acceptable in KVM.
> > > 
> > > I think that can be done without KVM having any awareness of CR pinning 
> > > whatsoever.
> > > E.g. userspace just needs to ability to intercept CR writes and inject 
> > > #GPs.  Off
> > > the cuff, I suspect the uAPI could look very similar to MSR filtering.  
> > > E.g. I bet
> > > userspace could enforce MSR pinning without any new KVM uAPI at all.
> > > 
> > > [*] https://lore.kernel.org/all/zfuyhpuhtmbyd...@google.com
> > 
> > OK, I had concern about the control not directly coming from the guest,
> > especially in the case of pKVM and confidential computing, but I get you
> 
> Hardware-based CoCo is completely out of scope, because KVM has zero 
> visibility
> into the guest (well, SNP technically allows trapping CR0/CR4, but KVM really
> shouldn't intercept CR0/CR4 for SNP guests).
> 
> And more importantly, _KVM_ doesn't define any policies for CoCo VMs.  KVM 
> might
> help enforce policies that are defined by hardware/firmware, but KVM doesn't
> define any of its own.
> 
> If pKVM on x86 comes along, then KVM will likely get in the business of 
> defining
> policy, but until that happens, KVM needs to stay firmly out of the picture.
> 
> > point.  It should indeed be quite similar to the MSR filtering on the
> > userspace side, except that we need another interface for the guest to
> > request such change (i.e. self-protection).
> > 
> > Would it be OK to keep this new KVM_HC_LOCK_CR_UPDATE hypercall but
> > forward the request to userspace with a VM exit instead?  That would
> > also enable userspace to get the request and directly configure the CR
> > pinning with the same VM exit.
> 
> No?  Maybe?  I strongly suspect that full support will need a richer set of 
> APIs
> than a single hypercall.  E.g. to handle kexec(), suspend+resume, emulated 
> SMM,
> and so on and so forth.  And that's just for CR pinning.
> 
> And hypercalls are hampered by the fact that VMCALL/VMMCALL don't allow for
> delegation or restriction, i.e. there's no way for the guest to communicate to
> the hypervisor that a less privileged component is allowed to perform some 
> action,
> nor is there a way for the guest to say some chunk of CPL0 code *isn't* 
> allowed
> to request transition.  Delegation and restriction all has to be done 
> out-of-band.
> 
> It'd probably be more annoying to setup initially, but I think a synthetic 
> device
> with an MMIO-based interface would be more powerful and flexible in the long 
> run.
> Then userspace can evolve without needing to wait for KVM to catch up.
> 
> Actually, potential bad/crazy idea.  Why does the _host_ need to define 
> policy?
> Linux alre

Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation

2024-05-06 Thread Mickaël Salaün
On Fri, May 03, 2024 at 07:03:21AM GMT, Sean Christopherson wrote:
> On Fri, May 03, 2024, Mickaël Salaün wrote:
> > Add an interface for user space to be notified about guests' Heki policy
> > and related violations.
> > 
> > Extend the KVM_ENABLE_CAP IOCTL with KVM_CAP_HEKI_CONFIGURE and
> > KVM_CAP_HEKI_DENIAL. Each one takes a bitmask as first argument that can
> > contains KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4. The
> > returned value is the bitmask of known Heki exit reasons, for now:
> > KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4.
> > 
> > If KVM_CAP_HEKI_CONFIGURE is set, a VM exit will be triggered for each
> > KVM_HC_LOCK_CR_UPDATE hypercalls according to the requested control
> > register. This enables to enlighten the VMM with the guest
> > auto-restrictions.
> > 
> > If KVM_CAP_HEKI_DENIAL is set, a VM exit will be triggered for each
> > pinned CR violation. This enables the VMM to react to a policy
> > violation.
> > 
> > Cc: Borislav Petkov 
> > Cc: Dave Hansen 
> > Cc: H. Peter Anvin 
> > Cc: Ingo Molnar 
> > Cc: Kees Cook 
> > Cc: Madhavan T. Venkataraman 
> > Cc: Paolo Bonzini 
> > Cc: Sean Christopherson 
> > Cc: Thomas Gleixner 
> > Cc: Vitaly Kuznetsov 
> > Cc: Wanpeng Li 
> > Signed-off-by: Mickaël Salaün 
> > Link: https://lore.kernel.org/r/20240503131910.307630-4-...@digikod.net
> > ---
> > 
> > Changes since v1:
> > * New patch. Making user space aware of Heki properties was requested by
> >   Sean Christopherson.
> 
> No, I suggested having userspace _control_ the pinning[*], not merely be 
> notified
> of pinning.
> 
>  : IMO, manipulation of protections, both for memory (this patch) and CPU 
> state
>  : (control registers in the next patch) should come from userspace.  I have 
> no
>  : objection to KVM providing plumbing if necessary, but I think userspace 
> needs to
>  : to have full control over the actual state.
>  : 
>  : One of the things that caused Intel's control register pinning series to 
> stall
>  : out was how to handle edge cases like kexec() and reboot.  Deferring to 
> userspace
>  : means the kernel doesn't need to define policy, e.g. when to unprotect 
> memory,
>  : and avoids questions like "should userspace be able to overwrite pinned 
> control
>  : registers".
>  : 
>  : And like the confidential VM use case, keeping userspace in the loop is a 
> big
>  : beneifit, e.g. the guest can't circumvent protections by coercing 
> userspace into
>  : writing to protected memory.
> 
> I stand by that suggestion, because I don't see a sane way to handle things 
> like
> kexec() and reboot without having a _much_ more sophisticated policy than 
> would
> ever be acceptable in KVM.
> 
> I think that can be done without KVM having any awareness of CR pinning 
> whatsoever.
> E.g. userspace just needs to ability to intercept CR writes and inject #GPs.  
> Off
> the cuff, I suspect the uAPI could look very similar to MSR filtering.  E.g. 
> I bet
> userspace could enforce MSR pinning without any new KVM uAPI at all.
> 
> [*] https://lore.kernel.org/all/zfuyhpuhtmbyd...@google.com

OK, I had concern about the control not directly coming from the guest,
especially in the case of pKVM and confidential computing, but I get you
point.  It should indeed be quite similar to the MSR filtering on the
userspace side, except that we need another interface for the guest to
request such change (i.e. self-protection).

Would it be OK to keep this new KVM_HC_LOCK_CR_UPDATE hypercall but
forward the request to userspace with a VM exit instead?  That would
also enable userspace to get the request and directly configure the CR
pinning with the same VM exit.



[RFC PATCH v3 5/5] virt: Add Heki KUnit tests

2024-05-03 Thread Mickaël Salaün
The new CONFIG_HEKI_KUNIT_TEST option enables to run tests in a a kernel
module.  The minimal required configuration is listed in the
virt/heki-test/.kunitconfig file.

test_cr_disable_smep checks control-register pinning by trying to
disable SMEP.  This test should then failed on a non-protected kernel,
and only succeed with a kernel protected by Heki.

This test doesn't rely on native_write_cr4() because of the
cr4_pinned_mask hardening, which means that this *test* module loads a
valid kernel code to arbitrary change CR4.  This simulate an attack
scenario where an attaker would use ROP to directly jump to the related
cr4 instruction.

As for any KUnit test, the kernel is tainted with TAINT_TEST when the
test is executed.

It is interesting to create new KUnit tests instead of extending KVM's
Kselftests because Heki is design to be hypervisor-agnostic, it relies
on a set of hypercalls (for KVM or others), and we also want to test
kernel's configuration (actual pinned CR).  However, new KVM's
Kselftests would be useful to test KVM's interface with the host.

When using Qemu, we need to pass the following arguments: -cpu host
-enable-kvm

For now, it is not possible to run these tests as built-in but we are
working on that [1].  If tests are built-in anyway, they will just be
skipped because Heki would not be enabled.

Run Heki tests with:
  insmod heki-test.ko

  KTAP version 1
  1..1
  KTAP version 1
  # Subtest: heki_x86
  # module: heki_test
  1..1
  ok 1 test_cr_disable_smep
  ok 1 heki_x86

Link: https://lore.kernel.org/r/20240229170409.365386-2-...@digikod.net [1]
Signed-off-by: Mickaël Salaün 
Link: https://lore.kernel.org/r/20240503131910.307630-6-...@digikod.net
---

Changes since v2:
* Make tests standalone (e.g. don't depends on CONFIG_HEKI).
* Enable to create a test kernel module.
* Don't rely on private kernel symbols.
* Handle GP fault for CR-pinning test case.
* Rename option to CONFIG_HEKI_KUNIT_TEST.
* Add the list of required kernel options.
* Move tests to virt/heki-test/ [FIXME]
* Only keep CR pinning test.
* Restore previous state (with SMEP enabled).
* Add a Kconfig menu for Heki and update the description.
* Skip tests if Heki is not protecting the running kernel.

Changes since v1:
* Move all tests to virt/heki/tests.c
---
 include/linux/heki.h   |   1 +
 virt/heki/.kunitconfig |   9 
 virt/heki/Kconfig  |  12 +
 virt/heki/Makefile |   1 +
 virt/heki/heki-test.c  | 114 +
 virt/heki/main.c   |  10 
 6 files changed, 147 insertions(+)
 create mode 100644 virt/heki/.kunitconfig
 create mode 100644 virt/heki/heki-test.c

diff --git a/include/linux/heki.h b/include/linux/heki.h
index 96ccb17657e5..3294c4d583e5 100644
--- a/include/linux/heki.h
+++ b/include/linux/heki.h
@@ -35,6 +35,7 @@ struct heki {
 
 extern struct heki heki;
 extern bool heki_enabled;
+extern bool heki_enforcing;
 
 void heki_early_init(void);
 void heki_late_init(void);
diff --git a/virt/heki/.kunitconfig b/virt/heki/.kunitconfig
new file mode 100644
index ..ad4454800579
--- /dev/null
+++ b/virt/heki/.kunitconfig
@@ -0,0 +1,9 @@
+CONFIG_HEKI=y
+CONFIG_HEKI_KUNIT_TEST=m
+CONFIG_HEKI_MENU=y
+CONFIG_HIGH_RES_TIMERS=y
+CONFIG_HYPERVISOR_GUEST=y
+CONFIG_KUNIT=y
+CONFIG_KVM=y
+CONFIG_KVM_GUEST=y
+CONFIG_PARAVIRT=y
diff --git a/virt/heki/Kconfig b/virt/heki/Kconfig
index 0c764e342f48..18895a81a9af 100644
--- a/virt/heki/Kconfig
+++ b/virt/heki/Kconfig
@@ -28,4 +28,16 @@ config HEKI
  This feature is helpful in maintaining guest virtual machine security
  even after the guest kernel has been compromised.
 
+config HEKI_KUNIT_TEST
+   tristate "KUnit tests for Heki" if !KUNIT_ALL_TESTS
+   depends on KUNIT
+   depends on X86
+   default KUNIT_ALL_TESTS
+   help
+ Build KUnit tests for Landlock.
+
+ See the KUnit documentation in Documentation/dev-tools/kunit
+
+ If you are unsure how to answer this question, answer N.
+
 endif
diff --git a/virt/heki/Makefile b/virt/heki/Makefile
index 8b10e73a154b..7133545eb5ae 100644
--- a/virt/heki/Makefile
+++ b/virt/heki/Makefile
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
 obj-$(CONFIG_HEKI) += main.o
+obj-$(CONFIG_HEKI_KUNIT_TEST) += heki-test.o
diff --git a/virt/heki/heki-test.c b/virt/heki/heki-test.c
new file mode 100644
index ..b4e11c21ac5d
--- /dev/null
+++ b/virt/heki/heki-test.c
@@ -0,0 +1,114 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Hypervisor Enforced Kernel Integrity (Heki) - Tests
+ *
+ * Copyright © 2023-2024 Microsoft Corporation
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* Returns true on error (i.e. GP fault), false otherwise. */
+static __always_inline bool set_cr4(unsigned long value)
+{
+   int err = 0;
+
+   might_sleep();
+   /* clang-format off */
+   asm volatile("1: mov %[value],%%cr4 \n"
+  

[RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation

2024-05-03 Thread Mickaël Salaün
Add an interface for user space to be notified about guests' Heki policy
and related violations.

Extend the KVM_ENABLE_CAP IOCTL with KVM_CAP_HEKI_CONFIGURE and
KVM_CAP_HEKI_DENIAL. Each one takes a bitmask as first argument that can
contains KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4. The
returned value is the bitmask of known Heki exit reasons, for now:
KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4.

If KVM_CAP_HEKI_CONFIGURE is set, a VM exit will be triggered for each
KVM_HC_LOCK_CR_UPDATE hypercalls according to the requested control
register. This enables to enlighten the VMM with the guest
auto-restrictions.

If KVM_CAP_HEKI_DENIAL is set, a VM exit will be triggered for each
pinned CR violation. This enables the VMM to react to a policy
violation.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Signed-off-by: Mickaël Salaün 
Link: https://lore.kernel.org/r/20240503131910.307630-4-...@digikod.net
---

Changes since v1:
* New patch. Making user space aware of Heki properties was requested by
  Sean Christopherson.
---
 arch/x86/kvm/vmx/vmx.c   |   5 +-
 arch/x86/kvm/x86.c   | 114 +++
 arch/x86/kvm/x86.h   |   7 +--
 include/linux/kvm_host.h |   2 +
 include/uapi/linux/kvm.h |  22 
 5 files changed, 136 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7ba970b525f7..5869a1ed7866 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5445,6 +5445,7 @@ static int handle_cr(struct kvm_vcpu *vcpu)
int reg;
int err;
int ret;
+   bool exit = false;
 
exit_qualification = vmx_get_exit_qual(vcpu);
cr = exit_qualification & 15;
@@ -5454,8 +5455,8 @@ static int handle_cr(struct kvm_vcpu *vcpu)
val = kvm_register_read(vcpu, reg);
trace_kvm_cr_write(cr, val);
 
-   ret = heki_check_cr(vcpu, cr, val);
-   if (ret)
+   ret = heki_check_cr(vcpu, cr, val, );
+   if (exit)
return ret;
 
switch (cr) {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a5f47be59abc..865e88f2b0fc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -119,6 +119,10 @@ static u64 __read_mostly cr4_reserved_bits = 
CR4_RESERVED_BITS;
 
 #define KVM_CAP_PMU_VALID_MASK KVM_PMU_CAP_DISABLE
 
+#define KVM_HEKI_EXIT_REASON_VALID_MASK ( \
+   KVM_HEKI_EXIT_REASON_CR0 | \
+   KVM_HEKI_EXIT_REASON_CR4)
+
 #define KVM_X2APIC_API_VALID_FLAGS (KVM_X2APIC_API_USE_32BIT_IDS | \
 KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK)
 
@@ -4836,6 +4840,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
if (kvm_is_vm_type_supported(KVM_X86_SW_PROTECTED_VM))
r |= BIT(KVM_X86_SW_PROTECTED_VM);
break;
+   case KVM_CAP_HEKI_CONFIGURE:
+   case KVM_CAP_HEKI_DENIAL:
+   r = KVM_HEKI_EXIT_REASON_VALID_MASK;
+   break;
default:
break;
}
@@ -6729,6 +6737,22 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
}
mutex_unlock(>lock);
break;
+#ifdef CONFIG_HEKI
+   case KVM_CAP_HEKI_CONFIGURE:
+   r = -EINVAL;
+   if (cap->args[0] & ~KVM_HEKI_EXIT_REASON_VALID_MASK)
+   break;
+   kvm->heki_configure_exit_reason = cap->args[0];
+   r = 0;
+   break;
+   case KVM_CAP_HEKI_DENIAL:
+   r = -EINVAL;
+   if (cap->args[0] & ~KVM_HEKI_EXIT_REASON_VALID_MASK)
+   break;
+   kvm->heki_denial_exit_reason = cap->args[0];
+   r = 0;
+   break;
+#endif
default:
r = -EINVAL;
break;
@@ -8283,11 +8307,60 @@ static unsigned long emulator_get_cr(struct 
x86_emulate_ctxt *ctxt, int cr)
 
 #ifdef CONFIG_HEKI
 
+static int complete_heki_configure_exit(struct kvm_vcpu *const vcpu)
+{
+   kvm_rax_write(vcpu, 0);
+   ++vcpu->stat.hypercalls;
+   return kvm_skip_emulated_instruction(vcpu);
+}
+
+static int complete_heki_denial_exit(struct kvm_vcpu *const vcpu)
+{
+   kvm_inject_gp(vcpu, 0);
+   return 1;
+}
+
+/* Returns true if the @exit_reason is handled by @vcpu->kvm. */
+static bool heki_exit_cr(struct kvm_vcpu *const vcpu, const __u32 exit_reason,
+const u64 heki_reason, unsigned long value)
+{
+   switch (exit_reason) {
+   case KVM_EXIT_HEKI_CONFIGURE:
+   if (!(vcpu->kvm->heki_configure_exit_reason & heki_reason))
+   return false;
+
+

[RFC PATCH v3 2/5] KVM: x86: Add new hypercall to lock control registers

2024-05-03 Thread Mickaël Salaün
This enables guests to lock their CR0 and CR4 registers with a subset of
X86_CR0_WP, X86_CR4_SMEP, X86_CR4_SMAP, X86_CR4_UMIP, X86_CR4_FSGSBASE
and X86_CR4_CET flags.

The new KVM_HC_LOCK_CR_UPDATE hypercall takes three arguments.  The
first is to identify the control register, the second is a bit mask to
pin (i.e. mark as read-only), and the third is for optional flags.

These register flags should already be pinned by Linux guests, but once
compromised, this self-protection mechanism could be disabled, which is
not the case with this dedicated hypercall.

Once the CRs are pinned by the guest, if it attempts to change them,
then a general protection fault is sent to the guest.

This hypercall may evolve and support new kind of registers or pinning.
The optional KVM_LOCK_CR_UPDATE_VERSION flag enables guests to know the
supported abilities by mapping the returned version with the related
features.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Signed-off-by: Mickaël Salaün 
Link: https://lore.kernel.org/r/20240503131910.307630-3-...@digikod.net
---

Changes since v1:
* Guard KVM_HC_LOCK_CR_UPDATE hypercall with CONFIG_HEKI.
* Move extern cr4_pinned_mask to x86.h (suggested by Kees Cook).
* Move VMX CR checks from vmx_set_cr*() to handle_cr() to make it
  possible to return to user space (see next commit).
* Change the heki_check_cr()'s first argument to vcpu.
* Don't use -KVM_EPERM in heki_check_cr().
* Generate a fault when the guest requests a denied CR update.
* Add a flags argument to get the version of this hypercall. Being able
  to do a preper version check was suggested by Wei Liu.
---
 Documentation/virt/kvm/x86/hypercalls.rst | 17 +
 arch/x86/include/uapi/asm/kvm_para.h  |  2 +
 arch/x86/kernel/cpu/common.c  |  7 +-
 arch/x86/kvm/vmx/vmx.c|  5 ++
 arch/x86/kvm/x86.c| 84 +++
 arch/x86/kvm/x86.h| 22 ++
 include/linux/kvm_host.h  |  5 ++
 include/uapi/linux/kvm_para.h |  1 +
 8 files changed, 141 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/x86/hypercalls.rst 
b/Documentation/virt/kvm/x86/hypercalls.rst
index 10db7924720f..3178576f4c47 100644
--- a/Documentation/virt/kvm/x86/hypercalls.rst
+++ b/Documentation/virt/kvm/x86/hypercalls.rst
@@ -190,3 +190,20 @@ the KVM_CAP_EXIT_HYPERCALL capability. Userspace must 
enable that capability
 before advertising KVM_FEATURE_HC_MAP_GPA_RANGE in the guest CPUID.  In
 addition, if the guest supports KVM_FEATURE_MIGRATION_CONTROL, userspace
 must also set up an MSR filter to process writes to MSR_KVM_MIGRATION_CONTROL.
+
+9. KVM_HC_LOCK_CR_UPDATE
+
+
+:Architecture: x86
+:Status: active
+:Purpose: Request some control registers to be restricted.
+
+- a0: identify a control register
+- a1: bit mask to make some flags read-only
+- a2: optional KVM_LOCK_CR_UPDATE_VERSION flag that will return the version of
+  this hypercall. Version 1 supports CR0 and CR4 pinning.
+
+The hypercall lets a guest request control register flags to be pinned for
+itself.
+
+Returns 0 on success or a KVM error code otherwise.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index a1efa7907a0b..cfc17f3d1877 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -149,4 +149,6 @@ struct kvm_vcpu_pv_apf_data {
 #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK
 #define KVM_PV_EOI_DISABLED 0x0
 
+#define KVM_LOCK_CR_UPDATE_VERSION (1 << 0)
+
 #endif /* _UAPI_ASM_X86_KVM_PARA_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 605c26c009c8..69695d9d6e2a 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -398,8 +398,11 @@ static __always_inline void setup_umip(struct cpuinfo_x86 
*c)
 }
 
 /* These bits should not change their value after CPU init is finished. */
-static const unsigned long cr4_pinned_mask = X86_CR4_SMEP | X86_CR4_SMAP | 
X86_CR4_UMIP |
-X86_CR4_FSGSBASE | X86_CR4_CET | 
X86_CR4_FRED;
+const unsigned long cr4_pinned_mask = X86_CR4_SMEP | X86_CR4_SMAP |
+ X86_CR4_UMIP | X86_CR4_FSGSBASE |
+ X86_CR4_CET | X86_CR4_FRED;
+EXPORT_SYMBOL_GPL(cr4_pinned_mask);
+
 static DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
 static unsigned long cr4_pinned_bits __ro_after_init;
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 22411f4aff53..7ba970b525f7 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5453,6 +5453,11 @@ static int handle_cr(struct kvm_vcpu *vcpu)
case 0: /* mov to cr */
val = kvm_register_read(vcp

[RFC PATCH v3 1/5] virt: Introduce Hypervisor Enforced Kernel Integrity (Heki)

2024-05-03 Thread Mickaël Salaün
From: Madhavan T. Venkataraman 

Hypervisor Enforced Kernel Integrity (Heki) is a feature that will use
the hypervisor to enhance guest virtual machine security.

Implement minimal code to introduce Heki:

- Define the config variables.

- Define a kernel command line parameter "heki" to turn the feature
  on or off. By default, Heki is on.

- Define heki_early_init() and call it in start_kernel(). Currently,
  this function only prints the value of the "heki" command
  line parameter.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Co-developed-by: Mickaël Salaün 
Signed-off-by: Mickaël Salaün 
Signed-off-by: Madhavan T. Venkataraman 
Link: https://lore.kernel.org/r/20240503131910.307630-2-...@digikod.net
---

Changes since v2:
* Move CONFIG_HEKI under a new CONFIG_HEKI_MENU to group it with the
  test configuration (see following patches).
* Hide CONFIG_ARCH_SUPPORS_HEKI from users.

Changes since v1:
* Shrinked this patch to only contain the minimal common parts.
* Moved heki_early_init() to start_kernel().
* Use kstrtobool().
---
 Kconfig  |  2 ++
 arch/x86/Kconfig |  1 +
 include/linux/heki.h | 31 +++
 init/main.c  |  2 ++
 mm/mm_init.c |  1 +
 virt/Makefile|  1 +
 virt/heki/Kconfig| 25 +
 virt/heki/Makefile   |  3 +++
 virt/heki/common.h   | 16 
 virt/heki/main.c | 33 +
 10 files changed, 115 insertions(+)
 create mode 100644 include/linux/heki.h
 create mode 100644 virt/heki/Kconfig
 create mode 100644 virt/heki/Makefile
 create mode 100644 virt/heki/common.h
 create mode 100644 virt/heki/main.c

diff --git a/Kconfig b/Kconfig
index 745bc773f567..0c844d9bcb03 100644
--- a/Kconfig
+++ b/Kconfig
@@ -29,4 +29,6 @@ source "lib/Kconfig"
 
 source "lib/Kconfig.debug"
 
+source "virt/heki/Kconfig"
+
 source "Documentation/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 928820e61cb5..d2fba63c289b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -34,6 +34,7 @@ config X86_64
select SWIOTLB
select ARCH_HAS_ELFCORE_COMPAT
select ZONE_DMA32
+   select ARCH_SUPPORTS_HEKI
 
 config FORCE_DYNAMIC_FTRACE
def_bool y
diff --git a/include/linux/heki.h b/include/linux/heki.h
new file mode 100644
index ..4c18d2283392
--- /dev/null
+++ b/include/linux/heki.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Hypervisor Enforced Kernel Integrity (Heki) - Definitions
+ *
+ * Copyright © 2023 Microsoft Corporation
+ */
+
+#ifndef __HEKI_H__
+#define __HEKI_H__
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifdef CONFIG_HEKI
+
+extern bool heki_enabled;
+
+void heki_early_init(void);
+
+#else /* !CONFIG_HEKI */
+
+static inline void heki_early_init(void)
+{
+}
+
+#endif /* CONFIG_HEKI */
+
+#endif /* __HEKI_H__ */
diff --git a/init/main.c b/init/main.c
index 5dcf5274c09c..bec2c8d939aa 100644
--- a/init/main.c
+++ b/init/main.c
@@ -102,6 +102,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -1059,6 +1060,7 @@ void start_kernel(void)
uts_ns_init();
key_init();
security_init();
+   heki_early_init();
dbg_late_init();
net_ns_init();
vfs_caches_init();
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 549e76af8f82..89d9f97bd471 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 #include "slab.h"
 #include "shuffle.h"
diff --git a/virt/Makefile b/virt/Makefile
index 1cfea9436af9..856b5ccedb5a 100644
--- a/virt/Makefile
+++ b/virt/Makefile
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-y  += lib/
+obj-$(CONFIG_HEKI_MENU) += heki/
diff --git a/virt/heki/Kconfig b/virt/heki/Kconfig
new file mode 100644
index ..66e73d212856
--- /dev/null
+++ b/virt/heki/Kconfig
@@ -0,0 +1,25 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Hypervisor Enforced Kernel Integrity (Heki)
+
+config ARCH_SUPPORTS_HEKI
+   bool
+   # An architecture should select this when it can successfully build
+   # and run with CONFIG_HEKI. That is, it should provide all of the
+   # architecture support required for the HEKI feature.
+
+menuconfig HEKI_MENU
+   bool "Virtualization hardening"
+
+if HEKI_MENU
+
+config HEKI
+   bool "Hypervisor Enforced Kernel Integrity (Heki)"
+   depends on ARCH_SUPPORTS_HEKI
+   help
+ This feature enhances guest virtual machine security by taking
+ advantage of security features provided by the hypervisor for guests.
+ This feature is helpful in maintaining guest virtual machine security
+ even 

[RFC PATCH v3 0/5] Hypervisor-Enforced Kernel Integrity - CR pinning

2024-05-03 Thread Mickaël Salaün
 to the guest. The guest could then send a signal to the user
space process that triggered this policy violation (not implemented).

Heki can be enabled with the heki=1 boot command argument.

# Similar implementations

Here is a non-exhaustive list of similar implementations that we looked
at and took some ideas from. Linux mainline doesn't support such
security features, let's change that!

Windows's Virtualization-Based Security is a proprietary technology
that provides a superset of this kind of security mechanism, relying on
Hyper-V and Virtual Trust Levels which enables to have light and secure
VM enforcing restrictions on a full guest VM. This includes several
components such as HVCI for code authenticity, or HyperGuard for
monitoring and protecting kernel code and data.

Samsung's Real-time Kernel Protection (RKP) and Huawei Hypervisor
Execution Environment (HHEE) rely on proprietary hypervisors to protect
some Android devices. They monitor critical kernel data (e.g., page
tables, credentials, selinux_enforcing).

The iOS Kernel Patch Protection (KPP/Watchtower) is a proprietary
solution running in EL3 that monitors and protects critical parts of the
kernel. It is now replaced with a hardware-based mechanism: KTTR/RoRgn.

Bitdefender's Hypervisor Memory Introspection (HVMI) is an open-source
(but out of tree) set of components leveraging virtualization. HVMI
implementation is very complex, and this approach implies potential
semantic gap issues (i.e., kernel data structures may change from one
version to another).

Linux Kernel Runtime Guard is an open-source kernel module that can
detect some kernel data illegitimate modifications. Because it is the
same kernel as the compromised one, an attacker could also bypass or
disable these checks.

Intel's Virtualization Based Hardening [4] [5] is an open-source
proof-of-concept of a thin hypervisor dedicated to guest protection. As
such, it cannot be used to manage several VMs.

# Similar Linux patches

Paravirtualized Control Register pinning [3] added a set of KVM IOCTLs
to restrict some flags to be set. Heki doesn't implement such user space
interface, but only a dedicated hypercall to lock such registers. A
superset of these flags is configurable with Heki.

The Hypervisor Based Integrity patches [6] [7] only contain a generic
IPC mechanism (KVM_HC_UCALL hypercall) to request protection to the VMM.
The idea was to extend the KVM_SET_USER_MEMORY_REGION IOCTL to support
more permission than read-only.

# Current limitations

This patch series doesn't handle VM reboot, kexec, nor hybernate yet.
We'd like to leverage the realated feature from KVM CR-pinning patch
series [3].  Help appreciated!

We noticed that the KUnit tests don't work on AMD because the exception
table seems to not be properly handled (i.e. a double fault is
received).  Any reason why this would differ from an Intel's CPU?

What about extending register pinning to MSRs?  This should first be
implemented as a kernel self-protection though.

This patch series is also a call for collaboration. There is a lot to
do, either on hypervisors, guest kernels or VMMs sides.

# Resources

You can find related resources, including previous versions, and
conference talks about this work and the related LVBS project here:
https://github.com/heki-linux

[1] https://lore.kernel.org/all/20211006173113.26445-1-ala...@bitdefender.com/
[2] https://www.linux-kvm.org/images/7/72/KVMForum2017_Introspection.pdf
[3] 
https://lore.kernel.org/all/20200617190757.27081-1-john.s.ander...@intel.com/
[4] https://github.com/intel/vbh
[5] https://sched.co/TmwN
[6] https://sched.co/eE3f
[7] https://lore.kernel.org/all/20200501185147.208192-1-yua...@google.com/

Please reach out to us by replying to this thread, we're looking for
people to join and collaborate on this project!

Previous versions:
v2: https://lore.kernel.org/r/20231113022326.24388-1-...@digikod.net
v1: https://lore.kernel.org/r/20230505152046.6575-1-...@digikod.net

Regards,

Madhavan T. Venkataraman (1):
  virt: Introduce Hypervisor Enforced Kernel Integrity (Heki)

Mickaël Salaün (4):
  KVM: x86: Add new hypercall to lock control registers
  KVM: x86: Add notifications for Heki policy configuration and
violation
  heki: Lock guest control registers at the end of guest kernel init
  virt: Add Heki KUnit tests

 Documentation/virt/kvm/x86/hypercalls.rst |  17 ++
 Kconfig   |   2 +
 arch/x86/Kconfig  |   1 +
 arch/x86/include/asm/x86_init.h   |   1 +
 arch/x86/include/uapi/asm/kvm_para.h  |   2 +
 arch/x86/kernel/cpu/common.c  |   7 +-
 arch/x86/kernel/cpu/hypervisor.c  |   1 +
 arch/x86/kernel/kvm.c |  56 +++
 arch/x86/kvm/Kconfig  |   1 +
 arch/x86/kvm/vmx/vmx.c|   6 +
 arch/x86/kvm/x86.c| 180 ++
 arch/x86/kvm/x86.h|  23 +++
 include

[RFC PATCH v3 4/5] heki: Lock guest control registers at the end of guest kernel init

2024-05-03 Thread Mickaël Salaün
The hypervisor needs to provide some functions to support Heki. These
form the Heki-Hypervisor API.

Define a heki_hypervisor structure to house the API functions. A
hypervisor that supports Heki must instantiate a heki_hypervisor
structure and pass it to the Heki common code. This allows the common
code to access these functions in a hypervisor-agnostic way.

The first function that is implemented is lock_crs() (lock control
registers). That is, certain flags in the control registers are pinned
so that they can never be changed for the lifetime of the guest.

Implement Heki support in the guest:

- Each supported hypervisor in x86 implements a set of functions for the
  guest kernel. Add an init_heki() function to that set.  This function
  initializes Heki-related stuff. Call init_heki() for the detected
  hypervisor in init_hypervisor_platform().

- Implement init_heki() for the guest.

- Implement kvm_lock_crs() in the guest to lock down control registers.
  This function calls a KVM hypercall to do the job.

- Instantiate a heki_hypervisor structure that contains a pointer to
  kvm_lock_crs().

- Pass the heki_hypervisor structure to Heki common code in init_heki().

Implement a heki_late_init() function and call it at the end of kernel
init. This function calls lock_crs(). In other words, control registers
of a guest are locked down at the end of guest kernel init.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Co-developed-by: Madhavan T. Venkataraman 
Signed-off-by: Madhavan T. Venkataraman 
Signed-off-by: Mickaël Salaün 
Link: https://lore.kernel.org/r/20240503131910.307630-5-...@digikod.net
---

Changes since v2:
* Hide CONFIG_HYPERVISOR_SUPPORTS_HEKI from users.

Changes since v1:
* Shrinked the patch to only manage the CR pinning.
---
 arch/x86/include/asm/x86_init.h  |  1 +
 arch/x86/kernel/cpu/hypervisor.c |  1 +
 arch/x86/kernel/kvm.c| 56 
 arch/x86/kvm/Kconfig |  1 +
 include/linux/heki.h | 22 +
 init/main.c  |  1 +
 virt/heki/Kconfig|  8 -
 virt/heki/main.c | 25 ++
 8 files changed, 114 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 6149eabe200f..113998799473 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -128,6 +128,7 @@ struct x86_hyper_init {
bool (*msi_ext_dest_id)(void);
void (*init_mem_mapping)(void);
void (*init_after_bootmem)(void);
+   void (*init_heki)(void);
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c
index 553bfbfc3a1b..6085c8129e0c 100644
--- a/arch/x86/kernel/cpu/hypervisor.c
+++ b/arch/x86/kernel/cpu/hypervisor.c
@@ -106,4 +106,5 @@ void __init init_hypervisor_platform(void)
 
x86_hyper_type = h->type;
x86_init.hyper.init_platform();
+   x86_init.hyper.init_heki();
 }
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 7f0732bc0ccd..a54f2c0d7cd0 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -999,6 +1000,60 @@ static bool kvm_sev_es_hcall_finish(struct ghcb *ghcb, 
struct pt_regs *regs)
 }
 #endif
 
+#ifdef CONFIG_HEKI
+
+extern unsigned long cr4_pinned_mask;
+
+/*
+ * TODO: Check SMP policy consistency, e.g. with
+ * this_cpu_read(cpu_tlbstate.cr4)
+ */
+static int kvm_lock_crs(void)
+{
+   unsigned long cr4;
+   int err;
+
+   err = kvm_hypercall3(KVM_HC_LOCK_CR_UPDATE, 0, X86_CR0_WP, 0);
+   if (err)
+   return err;
+
+   cr4 = __read_cr4();
+   err = kvm_hypercall3(KVM_HC_LOCK_CR_UPDATE, 4, cr4 & cr4_pinned_mask,
+0);
+   return err;
+}
+
+static struct heki_hypervisor kvm_heki_hypervisor = {
+   .lock_crs = kvm_lock_crs,
+};
+
+static void kvm_init_heki(void)
+{
+   long err;
+
+   if (!kvm_para_available()) {
+   /* Cannot make KVM hypercalls. */
+   return;
+   }
+
+   err = kvm_hypercall3(KVM_HC_LOCK_CR_UPDATE, 0, 0,
+KVM_LOCK_CR_UPDATE_VERSION);
+   if (err < 1) {
+   /* Ignores host not supporting at least the first version. */
+   return;
+   }
+
+   heki.hypervisor = _heki_hypervisor;
+}
+
+#else /* CONFIG_HEKI */
+
+static void kvm_init_heki(void)
+{
+}
+
+#endif /* CONFIG_HEKI */
+
 const __initconst struct hypervisor_x86 x86_hyper_kvm = {
.name   = "KVM",
.detect = kvm_detect,
@@ -1007,6 +1062,7 @@ const __initconst struct hypervisor_x86 x86_hyper_kvm = {
.i

[RFC PATCH v2 00/19] Hypervisor-Enforced Kernel Integrity

2023-11-12 Thread Mickaël Salaün
 this as well.

We currently use static address ranges to configure protections at boot
(see heki_arch_early_init). This is not compatible with KASLR yet, but
this will be handled in a next patch series.

Because the guest's virtual address translation is not protected by the
hypervisor, a compromised kernel could map existing physical pages into
arbitrary virtual addresses. The new Intel's Hypervisor-Managed Linear
Address Translation [10] (HLAT) could be used to extend the current
protection and cover this case.

ROP is not covered by this patch series. Guest kernels can still jump to
arbitrary executable pages according to their control-flow integrity
protection.

# Future work

New dynamic restrictions could enable to improve the protected data by
including security-sensitive data such as LSM states, seccomp filters,
keyrings... This requires support outside of the hypervisor.

An execute-only mode could also be useful (cf. XOM for KVM [11] [12]).

Extending register pinning (e.g., MSRs).

For now, MBEC is only supported on a bare metal machine as KVM host;
nested virtualization is not supported yet.  Being able to protect
nested guests might be possible but we need to figure out the potential
security implications.

Protecting the host would be useful, but that doesn't really fit with
the KVM model. The Protected KVM project is a first step to help in this
direction [13].

We only tested this with an Intel CPU, but this approach should work the
same with an AMD CPU starting with the Zen 2 generation and their Guest
Mode Execute Trap (GMET) capability.

We also kept some TODOs to highlight missing checks and code sharing
issues, and some pr_warn() calls to help understand how it works. Tests
need to be improved (e.g., invalid hypercall arguments).

We'll present this work at the Linux Plumbers Conference next week.

[1] https://lore.kernel.org/all/20211006173113.26445-1-ala...@bitdefender.com/
[2] https://www.linux-kvm.org/images/7/72/KVMForum2017_Introspection.pdf
[3] 
https://lore.kernel.org/all/20200617190757.27081-1-john.s.ander...@intel.com/
[4] https://github.com/kvm-x86/linux
[5] https://lore.kernel.org/all/20231027182217.3615211-1-sea...@google.com/
[6] https://github.com/intel/vbh
[7] https://sched.co/TmwN
[8] https://sched.co/eE3f
[9] https://lore.kernel.org/all/20200501185147.208192-1-yua...@google.com/
[10] https://sched.co/eE4F
[11] 
https://lore.kernel.org/kvm/20191003212400.31130-1-rick.p.edgeco...@intel.com/
[12] https://lpc.events/event/4/contributions/283/
[13] https://sched.co/eE24

Please reach out to us by replying to this thread, we're looking for
people to join and collaborate on this project!

Previous version:
v1: https://lore.kernel.org/r/20230505152046.6575-1-...@digikod.net

Regards,

Madhavan T. Venkataraman (9):
  virt: Introduce Hypervisor Enforced Kernel Integrity (Heki)
  KVM: x86: Add new hypercall to set EPT permissions
  x86: Implement the Memory Table feature to store arbitrary per-page
data
  heki: Implement a kernel page table walker
  heki: x86: Initialize permissions counters for pages mapped into KVA
  heki: x86: Initialize permissions counters for pages in
vmap()/vunmap()
  heki: x86: Update permissions counters when guest page permissions
change
  heki: x86: Update permissions counters during text patching
  heki: x86: Protect guest kernel memory using the KVM hypervisor

Mickaël Salaün (10):
  KVM: x86: Add new hypercall to lock control registers
  KVM: x86: Add notifications for Heki policy configuration and
violation
  heki: Lock guest control registers at the end of guest kernel init
  KVM: VMX: Add MBEC support
  KVM: x86: Add kvm_x86_ops.fault_gva()
  KVM: x86: Make memory attribute helpers more generic
  KVM: x86: Extend kvm_vm_set_mem_attributes() with a mask
  KVM: x86: Extend kvm_range_has_memory_attributes() with match_all
  KVM: x86: Implement per-guest-page permissions
  virt: Add Heki KUnit tests

 Documentation/virt/kvm/x86/hypercalls.rst |  31 +++
 Kconfig   |   2 +
 arch/x86/Kconfig  |   1 +
 arch/x86/include/asm/kvm-x86-ops.h|   1 +
 arch/x86/include/asm/kvm_host.h   |   2 +
 arch/x86/include/asm/vmx.h|  11 +-
 arch/x86/include/asm/x86_init.h   |   1 +
 arch/x86/include/uapi/asm/kvm_para.h  |   2 +
 arch/x86/kernel/alternative.c |   5 +
 arch/x86/kernel/cpu/common.c  |   4 +-
 arch/x86/kernel/cpu/hypervisor.c  |   1 +
 arch/x86/kernel/kvm.c |  67 +
 arch/x86/kernel/setup.c   |   2 +
 arch/x86/kvm/Kconfig  |   2 +
 arch/x86/kvm/Makefile |   4 +-
 arch/x86/kvm/mmu.h|   3 +-
 arch/x86/kvm/mmu/mmu.c| 114 ++--
 arch/x86/kvm/mmu/mmutrace.h   |  11 +-
 arch/x86/kvm/mmu/paging_tmpl.h|  19 +-
 arch/x86/kvm/mmu/spte.c   |  19 +-
 arch/x86/kvm/mmu

[RFC PATCH v2 13/19] heki: Implement a kernel page table walker

2023-11-12 Thread Mickaël Salaün
From: Madhavan T. Venkataraman 

The Heki feature needs to do the following:

- Find kernel mappings.

- Determine the permissions associated with each mapping.

- Determine the collective permissions for a guest physical page across
  all of its mappings.

This way, a guest physical page can reflect only the required
permissions in the EPT thanks to the KVM_HC_PROTECT_MEMORY hypercall..

Implement a kernel page table walker that walks all of the kernel
mappings and calls a callback function for each mapping.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Co-developed-by: Mickaël Salaün 
Signed-off-by: Mickaël Salaün 
Signed-off-by: Madhavan T. Venkataraman 
---

Change since v1:
* New patch and new file: virt/heki/walk.c
---
 include/linux/heki.h |  16 +
 virt/heki/Makefile   |   1 +
 virt/heki/walk.c | 140 +++
 3 files changed, 157 insertions(+)
 create mode 100644 virt/heki/walk.c

diff --git a/include/linux/heki.h b/include/linux/heki.h
index 9b0c966c50d1..a7ae0b387dfe 100644
--- a/include/linux/heki.h
+++ b/include/linux/heki.h
@@ -61,6 +61,22 @@ struct heki {
struct heki_hypervisor *hypervisor;
 };
 
+/*
+ * The kernel page table is walked to locate kernel mappings. For each
+ * mapping, a callback function is called. The table walker passes information
+ * about the mapping to the callback using this structure.
+ */
+struct heki_args {
+   /* Information passed by the table walker to the callback. */
+   unsigned long va;
+   phys_addr_t pa;
+   size_t size;
+   unsigned long flags;
+};
+
+/* Callback function called by the table walker. */
+typedef void (*heki_func_t)(struct heki_args *args);
+
 extern struct heki heki;
 extern bool heki_enabled;
 
diff --git a/virt/heki/Makefile b/virt/heki/Makefile
index 354e567df71c..a5daa4ff7a4f 100644
--- a/virt/heki/Makefile
+++ b/virt/heki/Makefile
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
 obj-y += main.o
+obj-y += walk.o
diff --git a/virt/heki/walk.c b/virt/heki/walk.c
new file mode 100644
index ..e10b54226fcc
--- /dev/null
+++ b/virt/heki/walk.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Hypervisor Enforced Kernel Integrity (Heki) - Kernel page table walker.
+ *
+ * Copyright © 2023 Microsoft Corporation
+ *
+ * Cf. arch/x86/mm/init_64.c
+ */
+
+#include 
+#include 
+
+static void heki_walk_pte(pmd_t *pmd, unsigned long va, unsigned long va_end,
+ heki_func_t func, struct heki_args *args)
+{
+   pte_t *pte;
+   unsigned long next_va;
+
+   for (pte = pte_offset_kernel(pmd, va); va < va_end;
+va = next_va, pte++) {
+   next_va = (va + PAGE_SIZE) & PAGE_MASK;
+
+   if (next_va > va_end)
+   next_va = va_end;
+
+   if (!pte_present(*pte))
+   continue;
+
+   args->va = va;
+   args->pa = pte_pfn(*pte) << PAGE_SHIFT;
+   args->size = PAGE_SIZE;
+   args->flags = pte_flags(*pte);
+
+   func(args);
+   }
+}
+
+static void heki_walk_pmd(pud_t *pud, unsigned long va, unsigned long va_end,
+ heki_func_t func, struct heki_args *args)
+{
+   pmd_t *pmd;
+   unsigned long next_va;
+
+   for (pmd = pmd_offset(pud, va); va < va_end; va = next_va, pmd++) {
+   next_va = pmd_addr_end(va, va_end);
+
+   if (!pmd_present(*pmd))
+   continue;
+
+   if (pmd_large(*pmd)) {
+   args->va = va;
+   args->pa = pmd_pfn(*pmd) << PAGE_SHIFT;
+   args->pa += va & (PMD_SIZE - 1);
+   args->size = next_va - va;
+   args->flags = pmd_flags(*pmd);
+
+   func(args);
+   } else {
+   heki_walk_pte(pmd, va, next_va, func, args);
+   }
+   }
+}
+
+static void heki_walk_pud(p4d_t *p4d, unsigned long va, unsigned long va_end,
+ heki_func_t func, struct heki_args *args)
+{
+   pud_t *pud;
+   unsigned long next_va;
+
+   for (pud = pud_offset(p4d, va); va < va_end; va = next_va, pud++) {
+   next_va = pud_addr_end(va, va_end);
+
+   if (!pud_present(*pud))
+   continue;
+
+   if (pud_large(*pud)) {
+   args->va = va;
+   args->pa = pud_pfn(*pud) << PAGE_SHIFT;
+   args->pa += va & (PUD_SIZE - 1);
+   args->size = next_va - va;
+   args->flags = pud_flags(*pud);
+
+ 

[RFC PATCH v2 12/19] x86: Implement the Memory Table feature to store arbitrary per-page data

2023-11-12 Thread Mickaël Salaün
From: Madhavan T. Venkataraman 

This feature can be used by a consumer to associate any arbitrary
pointer with a physical page. The feature implements a page table format
that mirrors the hardware page table. A leaf entry in the table points
to consumer data for that page.

The page table format has these advantages:

- The format allows for a sparse representation. This is useful since
  the physical address space can be large and is typically sparsely
  populated in a system.

- A consumer of this feature can choose to populate data just for the
  pages he is interested in.

- Information can be stored for large pages, if a consumer wishes.

For instance, for Heki, the guest kernel uses this to create permissions
counters for each guest physical page. The permissions counters reflects
the collective permissions for a guest physical page across all mappings
to that page. This allows the guest to request the hypervisor to set
only the necessary permissions for a guest physical page in the EPT
(instead of RWX).

This feature could also be used to improve the KVM's memory attribute
and the write page tracking.

We will support large page entries in mem_table in a future version
thanks to extra mem_table_ops's merge() and split() operations.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Mickaël Salaün 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Signed-off-by: Madhavan T. Venkataraman 
---

Changes since v1:
* New patch and new file: kernel/mem_table.c
---
 arch/x86/kernel/setup.c   |   2 +
 include/linux/heki.h  |   1 +
 include/linux/mem_table.h |  55 ++
 kernel/Makefile   |   2 +
 kernel/mem_table.c| 219 ++
 5 files changed, 279 insertions(+)
 create mode 100644 include/linux/mem_table.h
 create mode 100644 kernel/mem_table.c

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index b098b1fa2470..e7ae46953ae4 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -1315,6 +1316,7 @@ void __init setup_arch(char **cmdline_p)
 #endif
 
unwind_init();
+   mem_table_init(PG_LEVEL_4K);
 }
 
 #ifdef CONFIG_X86_32
diff --git a/include/linux/heki.h b/include/linux/heki.h
index 89cc9273a968..9b0c966c50d1 100644
--- a/include/linux/heki.h
+++ b/include/linux/heki.h
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_HEKI
 
diff --git a/include/linux/mem_table.h b/include/linux/mem_table.h
new file mode 100644
index ..738bf12309f3
--- /dev/null
+++ b/include/linux/mem_table.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Memory table feature - Definitions.
+ *
+ * Copyright © 2023 Microsoft Corporation.
+ */
+
+#ifndef __MEM_TABLE_H__
+#define __MEM_TABLE_H__
+
+/* clang-format off */
+
+/*
+ * The MEM_TABLE bit is set on entries that point to an intermediate table.
+ * So, this bit is reserved. This means that pointers to consumer data must
+ * be at least two-byte aligned (so the MEM_TABLE bit is 0).
+ */
+#define MEM_TABLE  BIT(0)
+#define IS_LEAF(entry) !((uintptr_t)entry & MEM_TABLE)
+
+/* clang-format on */
+
+/*
+ * A memory table is arranged exactly like a page table. The memory table
+ * configuration reflects the hardware page table configuration.
+ */
+
+/* Parameters at each level of the memory table hierarchy. */
+struct mem_table_level {
+   unsigned int number;
+   unsigned int nentries;
+   unsigned int shift;
+   unsigned int mask;
+};
+
+struct mem_table {
+   struct mem_table_level *level;
+   struct mem_table_ops *ops;
+   bool changed;
+   void *entries[];
+};
+
+/* Operations that need to be supplied by a consumer of memory tables. */
+struct mem_table_ops {
+   void (*free)(void *buf);
+};
+
+void mem_table_init(unsigned int base_level);
+struct mem_table *mem_table_alloc(struct mem_table_ops *ops);
+void mem_table_free(struct mem_table *table);
+void **mem_table_create(struct mem_table *table, phys_addr_t pa);
+void **mem_table_find(struct mem_table *table, phys_addr_t pa,
+ unsigned int *level_num);
+
+#endif /* __MEM_TABLE_H__ */
diff --git a/kernel/Makefile b/kernel/Makefile
index 3947122d618b..dcef03ec5c54 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -131,6 +131,8 @@ obj-$(CONFIG_WATCH_QUEUE) += watch_queue.o
 obj-$(CONFIG_RESOURCE_KUNIT_TEST) += resource_kunit.o
 obj-$(CONFIG_SYSCTL_KUNIT_TEST) += sysctl-test.o
 
+obj-$(CONFIG_SPARSEMEM) += mem_table.o
+
 CFLAGS_stackleak.o += $(DISABLE_STACKLEAK_PLUGIN)
 obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += stackleak.o
 KASAN_SANITIZE_stackleak.o := n
diff --git a/kernel/mem_table.c b/kernel/mem_table.c
new file mode 100644
index ..280a1b5ddde0
--- /dev/null
+++ b/kernel/mem_tab

[RFC PATCH v2 08/19] KVM: x86: Extend kvm_vm_set_mem_attributes() with a mask

2023-11-12 Thread Mickaël Salaün
Enable to only update a subset of attributes.

This is needed to be able to use the XArray for different use cases and
make sure they don't interfere (see a following commit).

Cc: Chao Peng 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Sean Christopherson 
Cc: Yu Zhang 
Signed-off-by: Mickaël Salaün 
---

Changes since v1:
* New patch
---
 arch/x86/kvm/mmu/mmu.c   |  2 +-
 include/linux/kvm_host.h |  2 +-
 virt/kvm/kvm_main.c  | 27 +++
 3 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4d378d308762..d7010e09440d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7283,7 +7283,7 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct 
kvm_memory_slot *slot,
 
for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
if (hugepage_test_mixed(slot, gfn, level - 1) ||
-   attrs != kvm_get_memory_attributes(kvm, gfn))
+   !(attrs & kvm_get_memory_attributes(kvm, gfn)))
return false;
}
return true;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 85b8648fd892..de68390ab0f2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2397,7 +2397,7 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 struct kvm_gfn_range *range);
 int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
- unsigned long attributes);
+ unsigned long attributes, unsigned long mask);
 
 static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0096ccfbb609..e2c178db17d5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2436,7 +2436,7 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 /*
  * Returns true if _all_ gfns in the range [@start, @end) have attributes
- * matching @attrs.
+ * matching the @attrs bitmask.
  */
 bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 unsigned long attrs)
@@ -2459,7 +2459,8 @@ bool kvm_range_has_memory_attributes(struct kvm *kvm, 
gfn_t start, gfn_t end,
entry = xas_next();
} while (xas_retry(, entry));
 
-   if (xas.xa_index != index || xa_to_value(entry) != attrs) {
+   if (xas.xa_index != index ||
+   (xa_to_value(entry) & attrs) != attrs) {
has_attrs = false;
break;
}
@@ -2553,7 +2554,7 @@ static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
 
 /* Set @attributes for the gfn range [@start, @end). */
 int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
-unsigned long attributes)
+ unsigned long attributes, unsigned long mask)
 {
struct kvm_mmu_notifier_range pre_set_range = {
.start = start,
@@ -2572,11 +2573,8 @@ int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t 
start, gfn_t end,
.may_block = true,
};
unsigned long i;
-   void *entry;
int r = 0;
 
-   entry = attributes ? xa_mk_value(attributes) : NULL;
-
lockdep_assert_held(>slots_arch_lock);
 
/* Nothing to do if the entire range as the desired attributes. */
@@ -2596,6 +2594,16 @@ int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t 
start, gfn_t end,
kvm_handle_gfn_range(kvm, _set_range);
 
for (i = start; i < end; i++) {
+   unsigned long value = 0;
+   void *entry;
+
+   entry = xa_load(>mem_attr_array, i);
+   if (xa_is_value(entry))
+   value = xa_to_value(entry) & ~mask;
+
+   value |= attributes & mask;
+   entry = value ? xa_mk_value(value) : NULL;
+
r = xa_err(xa_store(>mem_attr_array, i, entry,
GFP_KERNEL_ACCOUNT));
KVM_BUG_ON(r, kvm);
@@ -2609,12 +2617,14 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm 
*kvm,
   struct kvm_memory_attributes *attrs)
 {
int r;
+   unsigned long attrs_mask;
gfn_t start, end;
 
/* flags is currently not used. */
if (attrs->flags)
return -EINVAL;
-   if (attrs->attributes & ~kvm_supported_mem_attributes(kvm))
+   attrs_mask = kvm_supported_mem_attributes(kvm);
+   if (attrs->attributes & ~attrs_mask)
return -EINVAL;
if (attrs->size == 0 || attrs->address + attrs->size < attrs->addres

[RFC PATCH v2 09/19] KVM: x86: Extend kvm_range_has_memory_attributes() with match_all

2023-11-12 Thread Mickaël Salaün
This enables to check if an attribute is tied to any memory page in a
range. This will be useful in a folling commit to check for
KVM_MEMORY_ATTRIBUTE_HEKI_IMMUTABLE.

Cc: Chao Peng 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Sean Christopherson 
Cc: Yu Zhang 
Signed-off-by: Mickaël Salaün 
---

Changes since v1:
* New patch
---
 arch/x86/kvm/mmu/mmu.c   |  2 +-
 include/linux/kvm_host.h |  2 +-
 virt/kvm/kvm_main.c  | 27 ++-
 3 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d7010e09440d..2024ff21d036 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7279,7 +7279,7 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct 
kvm_memory_slot *slot,
const unsigned long end = start + KVM_PAGES_PER_HPAGE(level);
 
if (level == PG_LEVEL_2M)
-   return kvm_range_has_memory_attributes(kvm, start, end, attrs);
+   return kvm_range_has_memory_attributes(kvm, start, end, attrs, 
true);
 
for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
if (hugepage_test_mixed(slot, gfn, level - 1) ||
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index de68390ab0f2..9ecb016a336f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2391,7 +2391,7 @@ static inline unsigned long 
kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn
 }
 
 bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
-unsigned long attrs);
+unsigned long attrs, bool match_all);
 bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
struct kvm_gfn_range *range);
 bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e2c178db17d5..67dbaaf40c1c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2435,11 +2435,11 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 /*
- * Returns true if _all_ gfns in the range [@start, @end) have attributes
- * matching the @attrs bitmask.
+ * According to @match_all, returns true if _all_ (respectively _any_) gfns in
+ * the range [@start, @end) have attributes matching the @attrs bitmask.
  */
 bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
-unsigned long attrs)
+unsigned long attrs, bool match_all)
 {
XA_STATE(xas, >mem_attr_array, start);
unsigned long index;
@@ -2453,16 +2453,25 @@ bool kvm_range_has_memory_attributes(struct kvm *kvm, 
gfn_t start, gfn_t end,
goto out;
}
 
-   has_attrs = true;
+   has_attrs = match_all;
for (index = start; index < end; index++) {
do {
entry = xas_next();
} while (xas_retry(, entry));
 
-   if (xas.xa_index != index ||
-   (xa_to_value(entry) & attrs) != attrs) {
-   has_attrs = false;
-   break;
+   if (match_all) {
+   if (xas.xa_index != index ||
+   (xa_to_value(entry) & attrs) != attrs) {
+   has_attrs = false;
+   break;
+   }
+   } else {
+   index = xas.xa_index;
+   if (index < end &&
+   (xa_to_value(entry) & attrs) == attrs) {
+   has_attrs = true;
+   break;
+   }
}
}
 
@@ -2578,7 +2587,7 @@ int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t 
start, gfn_t end,
lockdep_assert_held(>slots_arch_lock);
 
/* Nothing to do if the entire range as the desired attributes. */
-   if (kvm_range_has_memory_attributes(kvm, start, end, attributes))
+   if (kvm_range_has_memory_attributes(kvm, start, end, attributes, true))
return r;
 
/*
-- 
2.42.1




[RFC PATCH v2 18/19] heki: x86: Protect guest kernel memory using the KVM hypervisor

2023-11-12 Thread Mickaël Salaün
From: Madhavan T. Venkataraman 

Implement a hypervisor function, kvm_protect_memory() that calls the
KVM_HC_PROTECT_MEMORY hypercall to request the KVM hypervisor to
set specified permissions on a list of guest pages.

Using the protect_memory() function, set proper EPT permissions for all
guest pages.

Use the MEM_ATTR_IMMUTABLE property to protect the kernel static
sections and the boot-time read-only sections. This enables to make sure
a compromised guest will not be able to change its main physical memory
page permissions. However, this also disable any feature that may change
the kernel's text section (e.g., ftrace, Kprobes), but they can still be
used on kernel modules.

Module loading/unloading, and eBPF JIT is allowed without restrictions
for now, but we'll need a way to authenticate these code changes to
really improve the guests' security. We plan to use module signatures,
but there is no solution yet to authenticate eBPF programs.

Being able to use ftrace and Kprobes in a secure way is a challenge not
solved yet. We're looking for ideas to make this work.

Likewise, the JUMP_LABEL feature cannot work because the kernel's text
section is read-only.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Co-developed-by: Mickaël Salaün 
Signed-off-by: Mickaël Salaün 
Signed-off-by: Madhavan T. Venkataraman 
---

Changes since v1:
* New patch
---
 arch/x86/kernel/kvm.c  | 11 ++
 arch/x86/kvm/mmu/mmu.c |  2 +-
 arch/x86/mm/heki.c | 21 ++
 include/linux/heki.h   | 26 
 virt/heki/Kconfig  |  1 +
 virt/heki/counters.c   | 90 --
 virt/heki/main.c   | 83 +-
 7 files changed, 229 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 8349f4ad3bbd..343615b0e3bf 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -1021,8 +1021,19 @@ static int kvm_lock_crs(void)
return err;
 }
 
+static int kvm_protect_memory(gpa_t pa)
+{
+   long err;
+
+   WARN_ON_ONCE(in_interrupt());
+
+   err = kvm_hypercall1(KVM_HC_PROTECT_MEMORY, pa);
+   return err;
+}
+
 static struct heki_hypervisor kvm_heki_hypervisor = {
.lock_crs = kvm_lock_crs,
+   .protect_memory = kvm_protect_memory,
 };
 
 static void kvm_init_heki(void)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2d09bcc35462..13be05e9ccf1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7374,7 +7374,7 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
int level;
 
lockdep_assert_held_write(>mmu_lock);
-   lockdep_assert_held(>slots_lock);
+   lockdep_assert_held(>slots_arch_lock);
 
/*
 * The sequence matters here: upper levels consume the result of lower
diff --git a/arch/x86/mm/heki.c b/arch/x86/mm/heki.c
index e4c60d8b4f2d..6c3fa9defada 100644
--- a/arch/x86/mm/heki.c
+++ b/arch/x86/mm/heki.c
@@ -45,6 +45,19 @@ __init void heki_arch_early_init(void)
heki_map(direct_map_end, kernel_end);
 }
 
+void heki_arch_late_init(void)
+{
+   /*
+* The permission counters for all existing kernel mappings have
+* already been updated. Now, walk all the pages, compute their
+* permissions from the counters and apply the permissions in the
+* host page table. To accomplish this, we walk the direct map
+* range.
+*/
+   heki_protect(direct_map_va, direct_map_end);
+   pr_warn("Guest memory protected\n");
+}
+
 unsigned long heki_flags_to_permissions(unsigned long flags)
 {
unsigned long permissions;
@@ -67,6 +80,11 @@ void heki_pgprot_to_permissions(pgprot_t prot, unsigned long 
*set,
*clear |= MEM_ATTR_EXEC;
 }
 
+unsigned long heki_default_permissions(void)
+{
+   return MEM_ATTR_READ | MEM_ATTR_WRITE;
+}
+
 static unsigned long heki_pgprot_to_flags(pgprot_t prot)
 {
unsigned long flags = 0;
@@ -100,6 +118,9 @@ static void heki_text_poke_common(struct page **pages, int 
npages,
heki_callback();
}
 
+   if (args.head)
+   heki_apply_permissions();
+
mutex_unlock(_lock);
 }
 
diff --git a/include/linux/heki.h b/include/linux/heki.h
index 6f2cfddc6dac..306bcec7ae92 100644
--- a/include/linux/heki.h
+++ b/include/linux/heki.h
@@ -15,6 +15,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 #ifdef CONFIG_HEKI
@@ -61,6 +63,7 @@ struct heki_page_list {
  */
 struct heki_hypervisor {
int (*lock_crs)(void); /* Lock control registers. */
+   int (*protect_memory)(gpa_t pa); /* Protect guest memory */
 };
 
 /*
@@ -74,16 +77,28 @@ struct heki_hypervisor {
  * - a page is mapped into the kernel address space
  * - a pa

[RFC PATCH v2 01/19] virt: Introduce Hypervisor Enforced Kernel Integrity (Heki)

2023-11-12 Thread Mickaël Salaün
From: Madhavan T. Venkataraman 

Hypervisor Enforced Kernel Integrity (Heki) is a feature that will use
the hypervisor to enhance guest virtual machine security.

Implement minimal code to introduce Heki:

- Define the config variables.

- Define a kernel command line parameter "heki" to turn the feature
  on or off. By default, Heki is on.

- Define heki_early_init() and call it in start_kernel(). Currently,
  this function only prints the value of the "heki" command
  line parameter.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Co-developed-by: Mickaël Salaün 
Signed-off-by: Mickaël Salaün 
Signed-off-by: Madhavan T. Venkataraman 
---

Changes since v1:
* Shrinked this patch to only contain the minimal common parts.
* Moved heki_early_init() to start_kernel().
---
 Kconfig  |  2 ++
 arch/x86/Kconfig |  1 +
 include/linux/heki.h | 31 +++
 init/main.c  |  2 ++
 mm/mm_init.c |  1 +
 virt/Makefile|  1 +
 virt/heki/Kconfig| 19 +++
 virt/heki/Makefile   |  3 +++
 virt/heki/common.h   | 16 
 virt/heki/main.c | 32 
 10 files changed, 108 insertions(+)
 create mode 100644 include/linux/heki.h
 create mode 100644 virt/heki/Kconfig
 create mode 100644 virt/heki/Makefile
 create mode 100644 virt/heki/common.h
 create mode 100644 virt/heki/main.c

diff --git a/Kconfig b/Kconfig
index 745bc773f567..0c844d9bcb03 100644
--- a/Kconfig
+++ b/Kconfig
@@ -29,4 +29,6 @@ source "lib/Kconfig"
 
 source "lib/Kconfig.debug"
 
+source "virt/heki/Kconfig"
+
 source "Documentation/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 66bfabae8814..424f949442bd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -35,6 +35,7 @@ config X86_64
select SWIOTLB
select ARCH_HAS_ELFCORE_COMPAT
select ZONE_DMA32
+   select ARCH_SUPPORTS_HEKI
 
 config FORCE_DYNAMIC_FTRACE
def_bool y
diff --git a/include/linux/heki.h b/include/linux/heki.h
new file mode 100644
index ..4c18d2283392
--- /dev/null
+++ b/include/linux/heki.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Hypervisor Enforced Kernel Integrity (Heki) - Definitions
+ *
+ * Copyright © 2023 Microsoft Corporation
+ */
+
+#ifndef __HEKI_H__
+#define __HEKI_H__
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifdef CONFIG_HEKI
+
+extern bool heki_enabled;
+
+void heki_early_init(void);
+
+#else /* !CONFIG_HEKI */
+
+static inline void heki_early_init(void)
+{
+}
+
+#endif /* CONFIG_HEKI */
+
+#endif /* __HEKI_H__ */
diff --git a/init/main.c b/init/main.c
index 436d73261810..0d28301c5402 100644
--- a/init/main.c
+++ b/init/main.c
@@ -99,6 +99,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -1047,6 +1048,7 @@ void start_kernel(void)
uts_ns_init();
key_init();
security_init();
+   heki_early_init();
dbg_late_init();
net_ns_init();
vfs_caches_init();
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 50f2f34745af..896977383cc3 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 #include "slab.h"
 #include "shuffle.h"
diff --git a/virt/Makefile b/virt/Makefile
index 1cfea9436af9..4550dc624466 100644
--- a/virt/Makefile
+++ b/virt/Makefile
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-y  += lib/
+obj-$(CONFIG_HEKI) += heki/
diff --git a/virt/heki/Kconfig b/virt/heki/Kconfig
new file mode 100644
index ..49695fff6d21
--- /dev/null
+++ b/virt/heki/Kconfig
@@ -0,0 +1,19 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Hypervisor Enforced Kernel Integrity (Heki)
+
+config HEKI
+   bool "Hypervisor Enforced Kernel Integrity (Heki)"
+   depends on ARCH_SUPPORTS_HEKI
+   help
+ This feature enhances guest virtual machine security by taking
+ advantage of security features provided by the hypervisor for guests.
+ This feature is helpful in maintaining guest virtual machine security
+ even after the guest kernel has been compromised.
+
+config ARCH_SUPPORTS_HEKI
+   bool "Architecture support for Heki"
+   help
+ An architecture should select this when it can successfully build
+ and run with CONFIG_HEKI. That is, it should provide all of the
+ architecture support required for the HEKI feature.
diff --git a/virt/heki/Makefile b/virt/heki/Makefile
new file mode 100644
index ..354e567df71c
--- /dev/null
+++ b/virt/heki/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+obj-y += main.o
diff --git a/virt/heki/common.h b/virt/heki/commo

[RFC PATCH v2 17/19] heki: x86: Update permissions counters during text patching

2023-11-12 Thread Mickaël Salaün
From: Madhavan T. Venkataraman 

X86 uses a function called __text_poke() to modify executable code. This
patching function is used by many features such as KProbes and FTrace.

Update the permissions counters for the text page so that write
permissions can be temporarily established in the EPT to modify the
instructions in that page.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Mickaël Salaün 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Signed-off-by: Madhavan T. Venkataraman 
---

Changes since v1:
* New patch
---
 arch/x86/kernel/alternative.c |  5 
 arch/x86/mm/heki.c| 49 +++
 include/linux/heki.h  | 14 ++
 3 files changed, 68 insertions(+)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 517ee01503be..64fd8757ba5c 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1801,6 +1802,7 @@ static void *__text_poke(text_poke_f func, void *addr, 
const void *src, size_t l
 */
pgprot = __pgprot(pgprot_val(PAGE_KERNEL) & ~_PAGE_GLOBAL);
 
+   heki_text_poke_start(pages, cross_page_boundary ? 2 : 1, pgprot);
/*
 * The lock is not really needed, but this allows to avoid open-coding.
 */
@@ -1865,7 +1867,10 @@ static void *__text_poke(text_poke_f func, void *addr, 
const void *src, size_t l
}
 
local_irq_restore(flags);
+
pte_unmap_unlock(ptep, ptl);
+   heki_text_poke_end(pages, cross_page_boundary ? 2 : 1, pgprot);
+
return addr;
 }
 
diff --git a/arch/x86/mm/heki.c b/arch/x86/mm/heki.c
index c0eace9e343f..e4c60d8b4f2d 100644
--- a/arch/x86/mm/heki.c
+++ b/arch/x86/mm/heki.c
@@ -5,8 +5,11 @@
  * Copyright © 2023 Microsoft Corporation
  */
 
+#include 
+#include 
 #include 
 #include 
+#include 
 
 #ifdef pr_fmt
 #undef pr_fmt
@@ -63,3 +66,49 @@ void heki_pgprot_to_permissions(pgprot_t prot, unsigned long 
*set,
if (pgprot_val(prot) & _PAGE_NX)
*clear |= MEM_ATTR_EXEC;
 }
+
+static unsigned long heki_pgprot_to_flags(pgprot_t prot)
+{
+   unsigned long flags = 0;
+
+   if (pgprot_val(prot) & _PAGE_RW)
+   flags |= _PAGE_RW;
+   if (pgprot_val(prot) & _PAGE_NX)
+   flags |= _PAGE_NX;
+   return flags;
+}
+
+static void heki_text_poke_common(struct page **pages, int npages,
+ pgprot_t prot, enum heki_cmd cmd)
+{
+   struct heki_args args = {
+   .cmd = cmd,
+   };
+   unsigned long va = poking_addr;
+   int i;
+
+   if (!heki.counters)
+   return;
+
+   mutex_lock(_lock);
+
+   for (i = 0; i < npages; i++, va += PAGE_SIZE) {
+   args.va = va;
+   args.pa = page_to_pfn(pages[i]) << PAGE_SHIFT;
+   args.size = PAGE_SIZE;
+   args.flags = heki_pgprot_to_flags(prot);
+   heki_callback();
+   }
+
+   mutex_unlock(_lock);
+}
+
+void heki_text_poke_start(struct page **pages, int npages, pgprot_t prot)
+{
+   heki_text_poke_common(pages, npages, prot, HEKI_MAP);
+}
+
+void heki_text_poke_end(struct page **pages, int npages, pgprot_t prot)
+{
+   heki_text_poke_common(pages, npages, prot, HEKI_UNMAP);
+}
diff --git a/include/linux/heki.h b/include/linux/heki.h
index 079b34af07f0..6f2cfddc6dac 100644
--- a/include/linux/heki.h
+++ b/include/linux/heki.h
@@ -111,6 +111,7 @@ typedef void (*heki_func_t)(struct heki_args *args);
 
 extern struct heki heki;
 extern bool heki_enabled;
+extern struct mutex heki_lock;
 
 extern bool __read_mostly enable_mbec;
 
@@ -123,12 +124,15 @@ void heki_map(unsigned long va, unsigned long end);
 void heki_update(unsigned long va, unsigned long end, unsigned long set,
 unsigned long clear);
 void heki_unmap(unsigned long va, unsigned long end);
+void heki_callback(struct heki_args *args);
 
 /* Arch-specific functions. */
 void heki_arch_early_init(void);
 unsigned long heki_flags_to_permissions(unsigned long flags);
 void heki_pgprot_to_permissions(pgprot_t prot, unsigned long *set,
unsigned long *clear);
+void heki_text_poke_start(struct page **pages, int npages, pgprot_t prot);
+void heki_text_poke_end(struct page **pages, int npages, pgprot_t prot);
 
 #else /* !CONFIG_HEKI */
 
@@ -149,6 +153,16 @@ static inline void heki_unmap(unsigned long va, unsigned 
long end)
 {
 }
 
+/* Arch-specific functions. */
+static inline void heki_text_poke_start(struct page **pages, int npages,
+   pgprot_t prot)
+{
+}
+static inline void heki_text_poke_end(struct page **pages, int npages,
+ pgprot_t prot)
+{
+}

[RFC PATCH v2 15/19] heki: x86: Initialize permissions counters for pages in vmap()/vunmap()

2023-11-12 Thread Mickaël Salaün
From: Madhavan T. Venkataraman 

When a page gets mapped, create permissions counters for it and
initialize them based on the specified permissions.

When a page gets unmapped, update the counters appropriately.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Mickaël Salaün 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Signed-off-by: Madhavan T. Venkataraman 
---

Changes since v1:
* New patch
---
 include/linux/heki.h | 11 ++-
 mm/vmalloc.c |  7 +++
 virt/heki/counters.c | 20 
 3 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/include/linux/heki.h b/include/linux/heki.h
index 86c787d121e0..d660994d34d0 100644
--- a/include/linux/heki.h
+++ b/include/linux/heki.h
@@ -68,7 +68,11 @@ struct heki_hypervisor {
  * pointer into this heki structure.
  *
  * During guest kernel boot, permissions counters for each guest page are
- * initialized based on the page's current permissions.
+ * initialized based on the page's current permissions. Beyond this point,
+ * the counters are updated whenever:
+ *
+ * - a page is mapped into the kernel address space
+ * - a page is unmapped from the kernel address space
  */
 struct heki {
struct heki_hypervisor *hypervisor;
@@ -77,6 +81,7 @@ struct heki {
 
 enum heki_cmd {
HEKI_MAP,
+   HEKI_UNMAP,
 };
 
 /*
@@ -109,6 +114,7 @@ void heki_counters_init(void);
 void heki_walk(unsigned long va, unsigned long va_end, heki_func_t func,
   struct heki_args *args);
 void heki_map(unsigned long va, unsigned long end);
+void heki_unmap(unsigned long va, unsigned long end);
 
 /* Arch-specific functions. */
 void heki_arch_early_init(void);
@@ -125,6 +131,9 @@ static inline void heki_late_init(void)
 static inline void heki_map(unsigned long va, unsigned long end)
 {
 }
+static inline void heki_unmap(unsigned long va, unsigned long end)
+{
+}
 
 #endif /* CONFIG_HEKI */
 
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a3fedb3ee0db..d9096502e571 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -301,6 +302,8 @@ static int vmap_range_noflush(unsigned long addr, unsigned 
long end,
if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
arch_sync_kernel_mappings(start, end);
 
+   heki_map(start, end);
+
return err;
 }
 
@@ -419,6 +422,8 @@ void __vunmap_range_noflush(unsigned long start, unsigned 
long end)
pgtbl_mod_mask mask = 0;
 
BUG_ON(addr >= end);
+   heki_unmap(start, end);
+
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
@@ -564,6 +569,8 @@ static int vmap_small_pages_range_noflush(unsigned long 
addr, unsigned long end,
if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
arch_sync_kernel_mappings(start, end);
 
+   heki_map(start, end);
+
return 0;
 }
 
diff --git a/virt/heki/counters.c b/virt/heki/counters.c
index 7067449cabca..adc8d566b8a9 100644
--- a/virt/heki/counters.c
+++ b/virt/heki/counters.c
@@ -88,6 +88,13 @@ void heki_callback(struct heki_args *args)
heki_update_counters(counters, 0, permissions, 0);
break;
 
+   case HEKI_UNMAP:
+   if (WARN_ON_ONCE(!counters))
+   break;
+   heki_update_counters(counters, permissions, 0,
+permissions);
+   break;
+
default:
WARN_ON_ONCE(1);
break;
@@ -124,6 +131,19 @@ void heki_map(unsigned long va, unsigned long end)
heki_func(va, end, );
 }
 
+/*
+ * Find the mappings in the given range and revert the permission counters for
+ * them.
+ */
+void heki_unmap(unsigned long va, unsigned long end)
+{
+   struct heki_args args = {
+   .cmd = HEKI_UNMAP,
+   };
+
+   heki_func(va, end, );
+}
+
 /*
  * Permissions counters are associated with each guest page using the
  * Memory Table feature. Initialize the permissions counters here.
-- 
2.42.1




[RFC PATCH v2 19/19] virt: Add Heki KUnit tests

2023-11-12 Thread Mickaël Salaün
This adds a new CONFIG_HEKI_TEST option to run tests at boot. Because we
use some symbols not exported to modules (e.g., kernel_set_to_readonly)
this could not work as modules.

To run these tests, we need to boot the kernel with the heki_test=N boot
argument with N selecting a specific test:
1. heki_test_cr_disable_smep: Check CR pinning and try to disable SMEP.
2. heki_test_write_to_const: Check .rodata (const) protection.
3. heki_test_write_to_ro_after_init: Check __ro_after_init protection.
4. heki_test_exec: Check non-executable kernel memory.

This way to select tests should not be required when the kernel will
properly handle the triggered synthetic page faults.  For now, these
page faults make the kernel loop.

All these tests temporarily disable the related kernel self-protections
and should then failed if Heki doesn't protect the kernel.  They are
verbose to make it easier to understand what is going on.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Signed-off-by: Mickaël Salaün 
---

Changes since v1:
* Move all tests to virt/heki/tests.c
---
 include/linux/heki.h |   1 +
 virt/heki/Kconfig|  12 +++
 virt/heki/Makefile   |   1 +
 virt/heki/main.c |   6 +-
 virt/heki/tests.c| 207 +++
 5 files changed, 226 insertions(+), 1 deletion(-)
 create mode 100644 virt/heki/tests.c

diff --git a/include/linux/heki.h b/include/linux/heki.h
index 306bcec7ae92..9e2cf0051ab0 100644
--- a/include/linux/heki.h
+++ b/include/linux/heki.h
@@ -149,6 +149,7 @@ void heki_protect(unsigned long va, unsigned long end);
 void heki_add_pa(struct heki_args *args, phys_addr_t pa,
 unsigned long permissions);
 void heki_apply_permissions(struct heki_args *args);
+void heki_run_test(void);
 
 /* Arch-specific functions. */
 void heki_arch_early_init(void);
diff --git a/virt/heki/Kconfig b/virt/heki/Kconfig
index 9bde84cd759e..fa814a921bb0 100644
--- a/virt/heki/Kconfig
+++ b/virt/heki/Kconfig
@@ -28,3 +28,15 @@ config HYPERVISOR_SUPPORTS_HEKI
  A hypervisor should select this when it can successfully build
  and run with CONFIG_HEKI. That is, it should provide all of the
  hypervisor support required for the Heki feature.
+
+config HEKI_TEST
+   bool "Tests for Heki" if !KUNIT_ALL_TESTS
+   depends on HEKI && KUNIT=y
+   default KUNIT_ALL_TESTS
+   help
+ Run Heki tests at runtime according to the heki_test=N boot
+ parameter, with N identifying the test to run (between 1 and 4).
+
+ Before launching the init process, the system might not respond
+ because of unhandled kernel page fault.  This will be fixed in a
+ next patch series.
diff --git a/virt/heki/Makefile b/virt/heki/Makefile
index 564f92faa9d8..a66cd0ba140b 100644
--- a/virt/heki/Makefile
+++ b/virt/heki/Makefile
@@ -3,3 +3,4 @@
 obj-y += main.o
 obj-y += walk.o
 obj-y += counters.o
+obj-y += tests.o
diff --git a/virt/heki/main.c b/virt/heki/main.c
index 5629334112e7..ce9984231996 100644
--- a/virt/heki/main.c
+++ b/virt/heki/main.c
@@ -51,8 +51,10 @@ void heki_late_init(void)
 {
struct heki_hypervisor *hypervisor = heki.hypervisor;
 
-   if (!heki.counters)
+   if (!heki.counters) {
+   heki_run_test();
return;
+   }
 
/* Locks control registers so a compromised guest cannot change them. */
if (WARN_ON(hypervisor->lock_crs()))
@@ -61,6 +63,8 @@ void heki_late_init(void)
pr_warn("Control registers locked\n");
 
heki_arch_late_init();
+
+   heki_run_test();
 }
 
 /*
diff --git a/virt/heki/tests.c b/virt/heki/tests.c
new file mode 100644
index ..6e6542b257f1
--- /dev/null
+++ b/virt/heki/tests.c
@@ -0,0 +1,207 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Hypervisor Enforced Kernel Integrity (Heki) - Common code
+ *
+ * Copyright © 2023 Microsoft Corporation
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "common.h"
+
+#ifdef CONFIG_HEKI_TEST
+
+/* Heki test data */
+
+/* Takes two pages to not change permission of other read-only pages. */
+const char heki_test_const_buf[PAGE_SIZE * 2] = {};
+char heki_test_ro_after_init_buf[PAGE_SIZE * 2] __ro_after_init = {};
+
+long heki_test_exec_data(long);
+void _test_exec_data_end(void);
+
+/* Used to test ROP execution against the .rodata section. */
+/* clang-format off */
+asm(
+".pushsection .rodata;" // NOT .text section
+".global heki_test_exec_data;"
+".type heki_test_exec_data, @function;"
+"heki_test_exec_data:"
+ASM_ENDBR
+"movq %rdi, %rax;"
+"inc %rax;"
+ASM_RET
+".size heki_test_exec_da

[RFC PATCH v2 10/19] KVM: x86: Implement per-guest-page permissions

2023-11-12 Thread Mickaël Salaün
Define memory attributes that can be associated with guest physical
pages in KVM. To begin with, define permissions as memory attributes
(READ, WRITE and EXECUTE), and the IMMUTABLE property. In the future,
other attributes could be defined.

Use the memory attribute feature to implement the following functions in
KVM:

- kvm_permissions_set(): Set the permissions for a guest page in the
  memory attribute XArray.

- kvm_permissions_get(): Retrieve the permissions associated with a
  guest page in same XArray.

These functions will be called in a following commit to associate proper
permissions with guest pages instead of RWX for all the pages.

Add 4 new memory attributes, private to the KVM implementation:
- KVM_MEMORY_ATTRIBUTE_HEKI_READ
- KVM_MEMORY_ATTRIBUTE_HEKI_WRITE
- KVM_MEMORY_ATTRIBUTE_HEKI_EXEC
- KVM_MEMORY_ATTRIBUTE_HEKI_IMMUTABLE

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Mickaël Salaün 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Co-developed-by: Madhavan T. Venkataraman 
Signed-off-by: Madhavan T. Venkataraman 
Signed-off-by: Mickaël Salaün 
---

Changes since v1:
* New patch replacing the deprecated page tracking mechanism.
* Add new files: virt/lib/kvm_permissions.c and
  include/linux/kvm_mem_attr.h
* Add new kvm_permissions_get() and kvm_permissions_set() leveraging
  the to-be-upstream memory attributes for KVM.
* Introduce the KVM_MEMORY_ATTRIBUTE_HEKI_* values.
---
 arch/x86/kvm/Kconfig |   1 +
 arch/x86/kvm/Makefile|   4 +-
 include/linux/kvm_mem_attr.h |  32 +++
 include/uapi/linux/kvm.h |   5 ++
 virt/heki/Kconfig|   1 +
 virt/lib/kvm_permissions.c   | 104 +++
 6 files changed, 146 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/kvm_mem_attr.h
 create mode 100644 virt/lib/kvm_permissions.c

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 7a3b52b7e456..ea6d73241632 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -50,6 +50,7 @@ config KVM
select HAVE_KVM_PM_NOTIFIER if PM
select KVM_GENERIC_HARDWARE_ENABLING
select HYPERVISOR_SUPPORTS_HEKI
+   select SPARSEMEM
help
  Support hosting fully virtualized guest machines using hardware
  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 80e3fe184d17..aac51a5d2cae 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -9,10 +9,12 @@ endif
 
 include $(srctree)/virt/kvm/Makefile.kvm
 
+VIRT_LIB = ../../../virt/lib
+
 kvm-y  += x86.o emulate.o i8259.o irq.o lapic.o \
   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
   hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o \
-  mmu/spte.o
+  mmu/spte.o $(VIRT_LIB)/kvm_permissions.o
 
 ifdef CONFIG_HYPERV
 kvm-y  += kvm_onhyperv.o
diff --git a/include/linux/kvm_mem_attr.h b/include/linux/kvm_mem_attr.h
new file mode 100644
index ..0a755025e553
--- /dev/null
+++ b/include/linux/kvm_mem_attr.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * KVM guest page permissions - Definitions.
+ *
+ * Copyright © 2023 Microsoft Corporation.
+ */
+#ifndef __KVM_MEM_ATTR_H__
+#define __KVM_MEM_ATTR_H__
+
+#include 
+#include 
+
+/* clang-format off */
+
+#define MEM_ATTR_READ  BIT(0)
+#define MEM_ATTR_WRITE BIT(1)
+#define MEM_ATTR_EXEC  BIT(2)
+#define MEM_ATTR_IMMUTABLE BIT(3)
+
+#define MEM_ATTR_PROT ( \
+   MEM_ATTR_READ | \
+   MEM_ATTR_WRITE | \
+   MEM_ATTR_EXEC | \
+   MEM_ATTR_IMMUTABLE)
+
+/* clang-format on */
+
+int kvm_permissions_set(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end,
+   unsigned long heki_attr);
+unsigned long kvm_permissions_get(struct kvm *kvm, gfn_t gfn);
+
+#endif /* __KVM_MEM_ATTR_H__ */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2477b4a16126..2b5b90216565 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -2319,6 +2319,11 @@ struct kvm_memory_attributes {
 
 #define KVM_MEMORY_ATTRIBUTE_PRIVATE   (1ULL << 3)
 
+#define KVM_MEMORY_ATTRIBUTE_HEKI_READ (1ULL << 4)
+#define KVM_MEMORY_ATTRIBUTE_HEKI_WRITE(1ULL << 5)
+#define KVM_MEMORY_ATTRIBUTE_HEKI_EXEC (1ULL << 6)
+#define KVM_MEMORY_ATTRIBUTE_HEKI_IMMUTABLE(1ULL << 7)
+
 #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO,  0xd4, struct 
kvm_create_guest_memfd)
 
 struct kvm_create_guest_memfd {
diff --git a/virt/heki/Kconfig b/virt/heki/Kconfig
index 5ea75b595667..75a784653e31 100644
--- a/virt/heki/Kconfig
+++ b/virt/heki/Kconfig
@@ -5,6 +5,7 @@
 config HEKI
bool "H

[RFC PATCH v2 11/19] KVM: x86: Add new hypercall to set EPT permissions

2023-11-12 Thread Mickaël Salaün
From: Madhavan T. Venkataraman 

Add a new KVM_HC_PROTECT_MEMORY hypercall that enables a guest to set
EPT permissions for guest pages.

Until now, all of the guest pages (except Page Tracked pages) are given
RWX permissions in the EPT. In Heki, we want to restrict the permissions
to what is strictly needed. For instance, a text page only needs R_X. A
read-only data page only needs R__. A normal data page only needs RW_.

The guest will pass a page list to the hypercall. The page list is a
list of one or more physical pages each of which contains a array of
guest ranges and attributes. Currently, the attributes only contain
permissions. In the future, other attributes may be added.  The
hypervisor will apply the specified permissions in the EPT.

When a guest try to access its memory in a way which is not allowed, KVM
creates a synthetic kernel page fault. This fault should be handled by
the guest, which is not currently the case, making it try again and
again.  This will be part of a follow-up patch series.

When enabled, KASAN reveals a bug in the memory attributes patches. We
didn't find the source of this issue yet.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Co-developed-by: Mickaël Salaün 
Signed-off-by: Mickaël Salaün 
Signed-off-by: Madhavan T. Venkataraman 
---

Changes since v1:

The original hypercall contained support for statically defined sections
(text, rodata, etc). It has been redesigned like this:

- The previous version accepted an array of physically contiguous
  ranges. This is appropriate for statically defined sections which are
  loaded in contiguous memory.  But, for other cases like module
  loading, the pages would be discontiguous. The current version of the
  hypercall accepts a page list to fix this.

- The previous version passed permission combinations. E.g.,
  HEKI_MEM_ATTR_EXEC would imply R_X. The current version passes
  permissions as memory attributes and each of the permissions must be
  separately specified. E.g., for text, (MEM_ATTR_READ | MEM_ATTR_EXEC)
  must be passed.

- The previous version locked down the permissions for guest pages so
  that once the permissions are set, they cannot be changed. In this
  version, permissions can be changed dynamically, except when the
  MEM_ATTR_IMMUTABLE is set.  So, the hypercall has been renamed from
  KVM_HC_LOCK_MEM_PAGE_RANGES to KVM_HC_PROTECT_MEMORY. The dynamic
  setting of permissions is needed by the following features (probably
  not a complete list):
  - Kprobes and Optprobes
  - Static call optimization
  - Jump Label optimization
  - Ftrace and Livepatch
  - Module loading and unloading
  - eBPF JIT
  - Kexec
  - Kgdb

Examples:
- A text page can be made writable very briefly to install a probe or a
  trace.
- eBPF JIT can populate a writable page with code and make it
  read-execute.
- Module load can load read-only data into a writable page and make the
  page read-only.
- When pages are unmapped, their permissions in the EPT must revert to
  read-write.
---
 Documentation/virt/kvm/x86/hypercalls.rst |  14 +++
 arch/x86/kvm/mmu/mmu.c|  77 +
 arch/x86/kvm/mmu/paging_tmpl.h|   3 +
 arch/x86/kvm/mmu/spte.c   |  15 ++-
 arch/x86/kvm/x86.c| 130 ++
 include/linux/heki.h  |  29 +
 include/uapi/linux/kvm_para.h |   1 +
 7 files changed, 267 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/x86/hypercalls.rst 
b/Documentation/virt/kvm/x86/hypercalls.rst
index 3178576f4c47..28865d111773 100644
--- a/Documentation/virt/kvm/x86/hypercalls.rst
+++ b/Documentation/virt/kvm/x86/hypercalls.rst
@@ -207,3 +207,17 @@ The hypercall lets a guest request control register flags 
to be pinned for
 itself.
 
 Returns 0 on success or a KVM error code otherwise.
+
+10. KVM_HC_PROTECT_MEMORY
+-
+
+:Architecture: x86
+:Status: active
+:Purpose: Request permissions to be set in EPT
+
+- a0: physical address of a struct heki_page_list
+
+The hypercall lets a guest request memory permissions to be set for a list
+of physical pages.
+
+Returns 0 on success or a KVM error code otherwise.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2024ff21d036..2d09bcc35462 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -47,9 +47,11 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -4446,6 +4448,75 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
   mmu_invalidate_retry_gfn(vcpu->kvm, fault->mmu_seq, fault->gfn);
 }
 
+static bool mem_attr_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+{
+   unsigned long perm;
+   bool noexec, nowrite;
+
+   if (unlikely(fa

[RFC PATCH v2 04/19] heki: Lock guest control registers at the end of guest kernel init

2023-11-12 Thread Mickaël Salaün
The hypervisor needs to provide some functions to support Heki. These
form the Heki-Hypervisor API.

Define a heki_hypervisor structure to house the API functions. A
hypervisor that supports Heki must instantiate a heki_hypervisor
structure and pass it to the Heki common code. This allows the common
code to access these functions in a hypervisor-agnostic way.

The first function that is implemented is lock_crs() (lock control
registers). That is, certain flags in the control registers are pinned
so that they can never be changed for the lifetime of the guest.

Implement Heki support in the guest:

- Each supported hypervisor in x86 implements a set of functions for the
  guest kernel. Add an init_heki() function to that set.  This function
  initializes Heki-related stuff. Call init_heki() for the detected
  hypervisor in init_hypervisor_platform().

- Implement init_heki() for the guest.

- Implement kvm_lock_crs() in the guest to lock down control registers.
  This function calls a KVM hypercall to do the job.

- Instantiate a heki_hypervisor structure that contains a pointer to
  kvm_lock_crs().

- Pass the heki_hypervisor structure to Heki common code in init_heki().

Implement a heki_late_init() function and call it at the end of kernel
init. This function calls lock_crs(). In other words, control registers
of a guest are locked down at the end of guest kernel init.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Co-developed-by: Madhavan T. Venkataraman 
Signed-off-by: Madhavan T. Venkataraman 
Signed-off-by: Mickaël Salaün 
---

Changes since v1:
* Shrinked the patch to only manage the CR pinning.
---
 arch/x86/include/asm/x86_init.h  |  1 +
 arch/x86/kernel/cpu/hypervisor.c |  1 +
 arch/x86/kernel/kvm.c| 56 
 arch/x86/kvm/Kconfig |  1 +
 include/linux/heki.h | 22 +
 init/main.c  |  1 +
 virt/heki/Kconfig|  9 -
 virt/heki/main.c | 25 ++
 8 files changed, 115 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 5240d88db52a..ff4dfd2f615e 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -127,6 +127,7 @@ struct x86_hyper_init {
bool (*msi_ext_dest_id)(void);
void (*init_mem_mapping)(void);
void (*init_after_bootmem)(void);
+   void (*init_heki)(void);
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c
index 553bfbfc3a1b..6085c8129e0c 100644
--- a/arch/x86/kernel/cpu/hypervisor.c
+++ b/arch/x86/kernel/cpu/hypervisor.c
@@ -106,4 +106,5 @@ void __init init_hypervisor_platform(void)
 
x86_hyper_type = h->type;
x86_init.hyper.init_platform();
+   x86_init.hyper.init_heki();
 }
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index b8ab9ee5896c..8349f4ad3bbd 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -997,6 +998,60 @@ static bool kvm_sev_es_hcall_finish(struct ghcb *ghcb, 
struct pt_regs *regs)
 }
 #endif
 
+#ifdef CONFIG_HEKI
+
+extern unsigned long cr4_pinned_mask;
+
+/*
+ * TODO: Check SMP policy consistency, e.g. with
+ * this_cpu_read(cpu_tlbstate.cr4)
+ */
+static int kvm_lock_crs(void)
+{
+   unsigned long cr4;
+   int err;
+
+   err = kvm_hypercall3(KVM_HC_LOCK_CR_UPDATE, 0, X86_CR0_WP, 0);
+   if (err)
+   return err;
+
+   cr4 = __read_cr4();
+   err = kvm_hypercall3(KVM_HC_LOCK_CR_UPDATE, 4, cr4 & cr4_pinned_mask,
+0);
+   return err;
+}
+
+static struct heki_hypervisor kvm_heki_hypervisor = {
+   .lock_crs = kvm_lock_crs,
+};
+
+static void kvm_init_heki(void)
+{
+   long err;
+
+   if (!kvm_para_available()) {
+   /* Cannot make KVM hypercalls. */
+   return;
+   }
+
+   err = kvm_hypercall3(KVM_HC_LOCK_CR_UPDATE, 0, 0,
+KVM_LOCK_CR_UPDATE_VERSION);
+   if (err < 1) {
+   /* Ignores host not supporting at least the first version. */
+   return;
+   }
+
+   heki.hypervisor = _heki_hypervisor;
+}
+
+#else /* CONFIG_HEKI */
+
+static void kvm_init_heki(void)
+{
+}
+
+#endif /* CONFIG_HEKI */
+
 const __initconst struct hypervisor_x86 x86_hyper_kvm = {
.name   = "KVM",
.detect = kvm_detect,
@@ -1005,6 +1060,7 @@ const __initconst struct hypervisor_x86 x86_hyper_kvm = {
.init.x2apic_available  = kvm_para_available,
.init.msi_ext_dest_id   = kvm_msi_ext_dest_id,
.init.init_platform

[RFC PATCH v2 03/19] KVM: x86: Add notifications for Heki policy configuration and violation

2023-11-12 Thread Mickaël Salaün
Add an interface for user space to be notified about guests' Heki policy
and related violations.

Extend the KVM_ENABLE_CAP IOCTL with KVM_CAP_HEKI_CONFIGURE and
KVM_CAP_HEKI_DENIAL. Each one takes a bitmask as first argument that can
contains KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4. The
returned value is the bitmask of known Heki exit reasons, for now:
KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4.

If KVM_CAP_HEKI_CONFIGURE is set, a VM exit will be triggered for each
KVM_HC_LOCK_CR_UPDATE hypercalls according to the requested control
register. This enables to enlighten the VMM with the guest
auto-restrictions.

If KVM_CAP_HEKI_DENIAL is set, a VM exit will be triggered for each
pinned CR violation. This enables the VMM to react to a policy
violation.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Signed-off-by: Mickaël Salaün 
---

Changes since v1:
* New patch. Making user space aware of Heki properties was requested by
  Sean Christopherson.
---
 arch/x86/kvm/vmx/vmx.c   |   5 +-
 arch/x86/kvm/x86.c   | 114 +++
 arch/x86/kvm/x86.h   |   7 +--
 include/linux/kvm_host.h |   2 +
 include/uapi/linux/kvm.h |  22 
 5 files changed, 136 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f487bf16dd96..b631b1d7ba30 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5444,6 +5444,7 @@ static int handle_cr(struct kvm_vcpu *vcpu)
int reg;
int err;
int ret;
+   bool exit = false;
 
exit_qualification = vmx_get_exit_qual(vcpu);
cr = exit_qualification & 15;
@@ -5453,8 +5454,8 @@ static int handle_cr(struct kvm_vcpu *vcpu)
val = kvm_register_read(vcpu, reg);
trace_kvm_cr_write(cr, val);
 
-   ret = heki_check_cr(vcpu, cr, val);
-   if (ret)
+   ret = heki_check_cr(vcpu, cr, val, );
+   if (exit)
return ret;
 
switch (cr) {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4e6c4c21f12c..43c28a6953bf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -119,6 +119,10 @@ static u64 __read_mostly cr4_reserved_bits = 
CR4_RESERVED_BITS;
 
 #define KVM_CAP_PMU_VALID_MASK KVM_PMU_CAP_DISABLE
 
+#define KVM_HEKI_EXIT_REASON_VALID_MASK ( \
+   KVM_HEKI_EXIT_REASON_CR0 | \
+   KVM_HEKI_EXIT_REASON_CR4)
+
 #define KVM_X2APIC_API_VALID_FLAGS (KVM_X2APIC_API_USE_32BIT_IDS | \
 KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK)
 
@@ -4644,6 +4648,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
if (kvm_is_vm_type_supported(KVM_X86_SW_PROTECTED_VM))
r |= BIT(KVM_X86_SW_PROTECTED_VM);
break;
+   case KVM_CAP_HEKI_CONFIGURE:
+   case KVM_CAP_HEKI_DENIAL:
+   r = KVM_HEKI_EXIT_REASON_VALID_MASK;
+   break;
default:
break;
}
@@ -6518,6 +6526,22 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
}
mutex_unlock(>lock);
break;
+#ifdef CONFIG_HEKI
+   case KVM_CAP_HEKI_CONFIGURE:
+   r = -EINVAL;
+   if (cap->args[0] & ~KVM_HEKI_EXIT_REASON_VALID_MASK)
+   break;
+   kvm->heki_configure_exit_reason = cap->args[0];
+   r = 0;
+   break;
+   case KVM_CAP_HEKI_DENIAL:
+   r = -EINVAL;
+   if (cap->args[0] & ~KVM_HEKI_EXIT_REASON_VALID_MASK)
+   break;
+   kvm->heki_denial_exit_reason = cap->args[0];
+   r = 0;
+   break;
+#endif
default:
r = -EINVAL;
break;
@@ -8056,11 +8080,60 @@ static unsigned long emulator_get_cr(struct 
x86_emulate_ctxt *ctxt, int cr)
 
 #ifdef CONFIG_HEKI
 
+static int complete_heki_configure_exit(struct kvm_vcpu *const vcpu)
+{
+   kvm_rax_write(vcpu, 0);
+   ++vcpu->stat.hypercalls;
+   return kvm_skip_emulated_instruction(vcpu);
+}
+
+static int complete_heki_denial_exit(struct kvm_vcpu *const vcpu)
+{
+   kvm_inject_gp(vcpu, 0);
+   return 1;
+}
+
+/* Returns true if the @exit_reason is handled by @vcpu->kvm. */
+static bool heki_exit_cr(struct kvm_vcpu *const vcpu, const __u32 exit_reason,
+const u64 heki_reason, unsigned long value)
+{
+   switch (exit_reason) {
+   case KVM_EXIT_HEKI_CONFIGURE:
+   if (!(vcpu->kvm->heki_configure_exit_reason & heki_reason))
+   return false;
+
+   vcpu->run->heki_configure.reason = heki_reason;
+

[RFC PATCH v2 05/19] KVM: VMX: Add MBEC support

2023-11-12 Thread Mickaël Salaün
This changes add support for VMX_FEATURE_MODE_BASED_EPT_EXEC (named
ept_mode_based_exec in /proc/cpuinfo and MBEC elsewhere), which enables
to separate EPT execution bits for supervisor vs. user.  It transforms
the semantic of VMX_EPT_EXECUTABLE_MASK from a global execution to a
kernel execution, and use the VMX_EPT_USER_EXECUTABLE_MASK bit to
identify user execution.

The main use case is to be able to restrict kernel execution while
ignoring user space execution from the hypervisor point of view.
Indeed, user space execution can already be restricted by the guest
kernel.

This change enables MBEC but doesn't change the default configuration,
which is to allow execution for all guest memory.  However, the next
commit levages MBEC to restrict kernel memory pages.

MBEC can be configured with the new "enable_mbec" module parameter, set
to true by default.  However, MBEC is disable for L1 and L2 for now.

The MMU tracepoints are updated to reflect the difference between kernel
and user space executions, see is_executable_pte().

Replace EPT_VIOLATION_RWX_MASK (3 bits) with 4 dedicated
EPT_VIOLATION_READ, EPT_VIOLATION_WRITE, EPT_VIOLATION_KERNEL_INSTR, and
EPT_VIOLATION_USER_INSTR bits.

>From the Intel 64 and IA-32 Architectures Software Developer's Manual,
Volume 3C (System Programming Guide), Part 3:

SECONDARY_EXEC_MODE_BASED_EPT_EXEC (bit 22):
If either the "unrestricted guest" VM-execution control or the
"mode-based execute control for EPT" VM-execution control is 1, the
"enable EPT" VM-execution control must also be 1.

EPT_VIOLATION_KERNEL_INSTR_BIT (bit 5):
The logical-AND of bit 2 in the EPT paging-structure entries used to
translate the guest-physical address of the access causing the EPT
violation.  If the "mode-based execute control for EPT" VM-execution
control is 0, this indicates whether the guest-physical address was
executable. If that control is 1, this indicates whether the
guest-physical address was executable for supervisor-mode linear
addresses.

EPT_VIOLATION_USER_INSTR_BIT (bit 6):
If the "mode-based execute control" VM-execution control is 0, the value
of this bit is undefined. If that control is 1, this bit is the
logical-AND of bit 10 in the EPT paging-structures entries used to
translate the guest-physical address of the access causing the EPT
violation. In this case, it indicates whether the guest-physical address
was executable for user-mode linear addresses.

PT_USER_EXEC_MASK (bit 10):
Execute access for user-mode linear addresses. If the "mode-based
execute control for EPT" VM-execution control is 1, indicates whether
instruction fetches are allowed from user-mode linear addresses in the
512-GByte region controlled by this entry. If that control is 0, this
bit is ignored.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Signed-off-by: Mickaël Salaün 
---

Changes since v1:
* Import the MMU tracepoint changes from the v1's "Enable guests to lock
  themselves thanks to MBEC" patch.
---
 arch/x86/include/asm/vmx.h  | 11 +--
 arch/x86/kvm/mmu.h  |  3 ++-
 arch/x86/kvm/mmu/mmu.c  |  8 ++--
 arch/x86/kvm/mmu/mmutrace.h | 11 +++
 arch/x86/kvm/mmu/paging_tmpl.h  | 16 ++--
 arch/x86/kvm/mmu/spte.c |  4 +++-
 arch/x86/kvm/mmu/spte.h | 15 +--
 arch/x86/kvm/vmx/capabilities.h |  7 +++
 arch/x86/kvm/vmx/nested.c   |  7 +++
 arch/x86/kvm/vmx/vmx.c  | 29 ++---
 arch/x86/kvm/vmx/vmx.h  |  1 +
 11 files changed, 95 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0e73616b82f3..7fd390484b36 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -513,6 +513,7 @@ enum vmcs_field {
 #define VMX_EPT_IPAT_BIT   (1ull << 6)
 #define VMX_EPT_ACCESS_BIT (1ull << 8)
 #define VMX_EPT_DIRTY_BIT  (1ull << 9)
+#define VMX_EPT_USER_EXECUTABLE_MASK   (1ull << 10)
 #define VMX_EPT_RWX_MASK(VMX_EPT_READABLE_MASK |   
\
 VMX_EPT_WRITABLE_MASK |   \
 VMX_EPT_EXECUTABLE_MASK)
@@ -558,13 +559,19 @@ enum vm_entry_failure_code {
 #define EPT_VIOLATION_ACC_READ_BIT 0
 #define EPT_VIOLATION_ACC_WRITE_BIT1
 #define EPT_VIOLATION_ACC_INSTR_BIT2
-#define EPT_VIOLATION_RWX_SHIFT3
+#define EPT_VIOLATION_READ_BIT 3
+#define EPT_VIOLATION_WRITE_BIT4
+#define EPT_VIOLATION_KERNEL_INSTR_BIT 5
+#define EPT_VIOLATION_USER_INSTR_BIT   6
 #define EPT_VIOLATION_GVA_IS_VALID_BIT 7
 #define 

[RFC PATCH v2 16/19] heki: x86: Update permissions counters when guest page permissions change

2023-11-12 Thread Mickaël Salaün
From: Madhavan T. Venkataraman 

When permissions are changed on an existing mapping, update the
permissions counters.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Mickaël Salaün 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Signed-off-by: Madhavan T. Venkataraman 
---

Changes since v1:
* New patch
---
 arch/x86/mm/heki.c   |  9 +++
 arch/x86/mm/pat/set_memory.c | 51 
 include/linux/heki.h | 14 ++
 virt/heki/counters.c | 23 
 4 files changed, 97 insertions(+)

diff --git a/arch/x86/mm/heki.c b/arch/x86/mm/heki.c
index c495df0d8772..c0eace9e343f 100644
--- a/arch/x86/mm/heki.c
+++ b/arch/x86/mm/heki.c
@@ -54,3 +54,12 @@ unsigned long heki_flags_to_permissions(unsigned long flags)
 
return permissions;
 }
+
+void heki_pgprot_to_permissions(pgprot_t prot, unsigned long *set,
+   unsigned long *clear)
+{
+   if (pgprot_val(prot) & _PAGE_RW)
+   *set |= MEM_ATTR_WRITE;
+   if (pgprot_val(prot) & _PAGE_NX)
+   *clear |= MEM_ATTR_EXEC;
+}
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index bda9f129835e..6aaa1ce5692c 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2056,11 +2057,56 @@ int clear_mce_nospec(unsigned long pfn)
 EXPORT_SYMBOL_GPL(clear_mce_nospec);
 #endif /* CONFIG_X86_64 */
 
+#ifdef CONFIG_HEKI
+
+static void heki_change_page_attr_set(unsigned long va, int numpages,
+ pgprot_t set)
+{
+   unsigned long va_end;
+   unsigned long set_permissions = 0, clear_permissions = 0;
+
+   heki_pgprot_to_permissions(set, _permissions, _permissions);
+   if (!(set_permissions | clear_permissions))
+   return;
+
+   va_end = va + (numpages << PAGE_SHIFT);
+   heki_update(va, va_end, set_permissions, clear_permissions);
+}
+
+static void heki_change_page_attr_clear(unsigned long va, int numpages,
+   pgprot_t clear)
+{
+   unsigned long va_end;
+   unsigned long set_permissions = 0, clear_permissions = 0;
+
+   heki_pgprot_to_permissions(clear, _permissions, _permissions);
+   if (!(set_permissions | clear_permissions))
+   return;
+
+   va_end = va + (numpages << PAGE_SHIFT);
+   heki_update(va, va_end, set_permissions, clear_permissions);
+}
+
+#else /* !CONFIG_HEKI */
+
+static void heki_change_page_attr_set(unsigned long va, int numpages,
+ pgprot_t set)
+{
+}
+
+static void heki_change_page_attr_clear(unsigned long va, int numpages,
+   pgprot_t clear)
+{
+}
+
+#endif /* CONFIG_HEKI */
+
 int set_memory_x(unsigned long addr, int numpages)
 {
if (!(__supported_pte_mask & _PAGE_NX))
return 0;
 
+   heki_change_page_attr_clear(addr, numpages, __pgprot(_PAGE_NX));
return change_page_attr_clear(, numpages, __pgprot(_PAGE_NX), 0);
 }
 
@@ -2069,11 +2115,14 @@ int set_memory_nx(unsigned long addr, int numpages)
if (!(__supported_pte_mask & _PAGE_NX))
return 0;
 
+   heki_change_page_attr_set(addr, numpages, __pgprot(_PAGE_NX));
return change_page_attr_set(, numpages, __pgprot(_PAGE_NX), 0);
 }
 
 int set_memory_ro(unsigned long addr, int numpages)
 {
+   // TODO: What about _PAGE_DIRTY?
+   heki_change_page_attr_clear(addr, numpages, __pgprot(_PAGE_RW));
return change_page_attr_clear(, numpages, __pgprot(_PAGE_RW | 
_PAGE_DIRTY), 0);
 }
 
@@ -2084,11 +2133,13 @@ int set_memory_rox(unsigned long addr, int numpages)
if (__supported_pte_mask & _PAGE_NX)
clr.pgprot |= _PAGE_NX;
 
+   heki_change_page_attr_clear(addr, numpages, clr);
return change_page_attr_clear(, numpages, clr, 0);
 }
 
 int set_memory_rw(unsigned long addr, int numpages)
 {
+   heki_change_page_attr_set(addr, numpages, __pgprot(_PAGE_RW));
return change_page_attr_set(, numpages, __pgprot(_PAGE_RW), 0);
 }
 
diff --git a/include/linux/heki.h b/include/linux/heki.h
index d660994d34d0..079b34af07f0 100644
--- a/include/linux/heki.h
+++ b/include/linux/heki.h
@@ -73,6 +73,7 @@ struct heki_hypervisor {
  *
  * - a page is mapped into the kernel address space
  * - a page is unmapped from the kernel address space
+ * - permissions are changed for a mapped page
  */
 struct heki {
struct heki_hypervisor *hypervisor;
@@ -81,6 +82,7 @@ struct heki {
 
 enum heki_cmd {
HEKI_MAP,
+   HEKI_UPDATE,
HEKI_UNMAP,
 };
 
@@ -98,6 +100,10 @@ struct heki_args {
 
/* Command passed by caller. */
enum heki_c

[RFC PATCH v2 14/19] heki: x86: Initialize permissions counters for pages mapped into KVA

2023-11-12 Thread Mickaël Salaün
From: Madhavan T. Venkataraman 

Define a permissions counters structure that contains a counter for
read, write and execute. Each mapped guest page will be allocated a
permissions counters structure.

During kernel boot, walk the kernel address space, locate all the
mappings, create permissions counters for each mapped guest page and
update the counters to reflect the collective permissions for each page
across all of its mappings.

The collective permissions will be applied in the EPT in a following
commit.

We might want to move these counters to a safer place (e.g., KVM) to
protect it from tampering by the guest kernel itself.

We should note that walking through all mappings might be slow if KASAN
is enabled.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Mickaël Salaün 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Suggested-by: Mickaël Salaün 
Signed-off-by: Madhavan T. Venkataraman 
---

Changes since v1:
* New patch and new files: arch/x86/mm/heki.c and virt/heki/counters.c
---
 arch/x86/mm/Makefile |   2 +
 arch/x86/mm/heki.c   |  56 +
 include/linux/heki.h |  32 ++
 virt/heki/Kconfig|   2 +
 virt/heki/Makefile   |   1 +
 virt/heki/counters.c | 147 +++
 virt/heki/main.c |  13 
 7 files changed, 253 insertions(+)
 create mode 100644 arch/x86/mm/heki.c
 create mode 100644 virt/heki/counters.c

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index c80febc44cd2..2998eaac0dbb 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -67,3 +67,5 @@ obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_amd.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt_boot.o
+
+obj-$(CONFIG_HEKI) += heki.o
diff --git a/arch/x86/mm/heki.c b/arch/x86/mm/heki.c
new file mode 100644
index ..c495df0d8772
--- /dev/null
+++ b/arch/x86/mm/heki.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Hypervisor Enforced Kernel Integrity (Heki) - Arch specific.
+ *
+ * Copyright © 2023 Microsoft Corporation
+ */
+
+#include 
+#include 
+
+#ifdef pr_fmt
+#undef pr_fmt
+#endif
+
+#define pr_fmt(fmt) "heki-guest: " fmt
+
+static unsigned long kernel_va;
+static unsigned long kernel_end;
+static unsigned long direct_map_va;
+static unsigned long direct_map_end;
+
+__init void heki_arch_early_init(void)
+{
+   /* Kernel virtual address space range, not yet compatible with KASLR. */
+   if (pgtable_l5_enabled()) {
+   kernel_va = 0xff00UL;
+   kernel_end = 0xffe0UL;
+   direct_map_va = 0xff11UL;
+   direct_map_end = 0xff91UL;
+   } else {
+   kernel_va = 0x8000UL;
+   kernel_end = 0xffe0UL;
+   direct_map_va = 0x8880UL;
+   direct_map_end = 0xc880UL;
+   }
+
+   /*
+* Initialize the counters for all existing kernel mappings except
+* for direct map.
+*/
+   heki_map(kernel_va, direct_map_va);
+   heki_map(direct_map_end, kernel_end);
+}
+
+unsigned long heki_flags_to_permissions(unsigned long flags)
+{
+   unsigned long permissions;
+
+   permissions = MEM_ATTR_READ | MEM_ATTR_EXEC;
+   if (flags & _PAGE_RW)
+   permissions |= MEM_ATTR_WRITE;
+   if (flags & _PAGE_NX)
+   permissions &= ~MEM_ATTR_EXEC;
+
+   return permissions;
+}
diff --git a/include/linux/heki.h b/include/linux/heki.h
index a7ae0b387dfe..86c787d121e0 100644
--- a/include/linux/heki.h
+++ b/include/linux/heki.h
@@ -19,6 +19,16 @@
 
 #ifdef CONFIG_HEKI
 
+/*
+ * This structure keeps track of the collective permissions for a guest page
+ * across all of its mappings.
+ */
+struct heki_counters {
+   int read;
+   int write;
+   int execute;
+};
+
 /*
  * This structure contains a guest physical range and its permissions (RWX).
  */
@@ -56,9 +66,17 @@ struct heki_hypervisor {
 /*
  * If the active hypervisor supports Heki, it will plug its heki_hypervisor
  * pointer into this heki structure.
+ *
+ * During guest kernel boot, permissions counters for each guest page are
+ * initialized based on the page's current permissions.
  */
 struct heki {
struct heki_hypervisor *hypervisor;
+   struct mem_table *counters;
+};
+
+enum heki_cmd {
+   HEKI_MAP,
 };
 
 /*
@@ -72,6 +90,9 @@ struct heki_args {
phys_addr_t pa;
size_t size;
unsigned long flags;
+
+   /* Command passed by caller. */
+   enum heki_cmd cmd;
 };
 
 /* Callback function called by the table walker. */
@@ -84,6 +105,14 @@ extern bool __read_mostly enable_mbec;
 
 void heki_early_init(void);
 void heki_late_init(void);
+void heki_counters_init(

[RFC PATCH v2 06/19] KVM: x86: Add kvm_x86_ops.fault_gva()

2023-11-12 Thread Mickaël Salaün
This function is needed for kvm_mmu_page_fault() to create synthetic
page faults.

Code originally written by Mihai Donțu and Nicușor Cîțu:
https://lore.kernel.org/r/20211006173113.26445-18-ala...@bitdefender.com
Renamed fault_gla() to fault_gva() and use the new
EPT_VIOLATION_GVA_IS_VALID.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Co-developed-by: Mihai Donțu 
Signed-off-by: Mihai Donțu 
Co-developed-by: Nicușor Cîțu 
Signed-off-by: Nicușor Cîțu 
Signed-off-by: Mickaël Salaün 
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h|  2 ++
 arch/x86/kvm/svm/svm.c |  9 +
 arch/x86/kvm/vmx/vmx.c | 10 ++
 4 files changed, 22 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h 
b/arch/x86/include/asm/kvm-x86-ops.h
index e3054e3e46d5..ba3db679db2b 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -134,6 +134,7 @@ KVM_X86_OP(msr_filter_changed)
 KVM_X86_OP(complete_emulated_msr)
 KVM_X86_OP(vcpu_deliver_sipi_vector)
 KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
+KVM_X86_OP(fault_gva)
 
 #undef KVM_X86_OP
 #undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dff10051e9b6..0415dacd4b28 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1750,6 +1750,8 @@ struct kvm_x86_ops {
 * Returns vCPU specific APICv inhibit reasons
 */
unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
+
+   u64 (*fault_gva)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index beea99c8e8e0..d32517a2cf9c 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4906,6 +4906,13 @@ static int svm_vm_init(struct kvm *kvm)
return 0;
 }
 
+static u64 svm_fault_gva(struct kvm_vcpu *vcpu)
+{
+   const struct vcpu_svm *svm = to_svm(vcpu);
+
+   return svm->vcpu.arch.cr2 ? svm->vcpu.arch.cr2 : ~0ull;
+}
+
 static struct kvm_x86_ops svm_x86_ops __initdata = {
.name = KBUILD_MODNAME,
 
@@ -5037,6 +5044,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 
.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
+
+   .fault_gva = svm_fault_gva,
 };
 
 /*
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 1b1581f578b0..a8158bc1dda9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8233,6 +8233,14 @@ static void vmx_vm_destroy(struct kvm *kvm)
free_pages((unsigned long)kvm_vmx->pid_table, 
vmx_get_pid_table_order(kvm));
 }
 
+static u64 vmx_fault_gva(struct kvm_vcpu *vcpu)
+{
+   if (vcpu->arch.exit_qualification & EPT_VIOLATION_GVA_IS_VALID)
+   return vmcs_readl(GUEST_LINEAR_ADDRESS);
+
+   return ~0ull;
+}
+
 static struct kvm_x86_ops vmx_x86_ops __initdata = {
.name = KBUILD_MODNAME,
 
@@ -8373,6 +8381,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.complete_emulated_msr = kvm_complete_insn_gp,
 
.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+
+   .fault_gva = vmx_fault_gva,
 };
 
 static unsigned int vmx_handle_intel_pt_intr(void)
-- 
2.42.1




[RFC PATCH v2 07/19] KVM: x86: Make memory attribute helpers more generic

2023-11-12 Thread Mickaël Salaün
To make it useful for other use cases such as Heki, remove the private
memory optimizations.

I guess we could try to infer the applied attributes to get back these
optimizations when it makes sense, but let's keep this simple for now.

Main changes:

- Replace slots_lock with slots_arch_lock to make it callable from a KVM
  hypercall.

- Move this mutex lock into kvm_vm_ioctl_set_mem_attributes() to make it
  easier to use with other locks.

- Export kvm_vm_set_mem_attributes().

- Remove the kvm_arch_pre_set_memory_attributes() and
  kvm_arch_post_set_memory_attributes() KVM_MEMORY_ATTRIBUTE_PRIVATE
  optimizations.

Cc: Chao Peng 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Sean Christopherson 
Cc: Yu Zhang 
Signed-off-by: Mickaël Salaün 
---

Changes since v1:
* New patch
---
 arch/x86/kvm/mmu/mmu.c   | 23 ---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c  | 19 ++-
 3 files changed, 12 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7e053973125c..4d378d308762 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7251,20 +7251,6 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
struct kvm_gfn_range *range)
 {
-   /*
-* Zap SPTEs even if the slot can't be mapped PRIVATE.  KVM x86 only
-* supports KVM_MEMORY_ATTRIBUTE_PRIVATE, and so it *seems* like KVM
-* can simply ignore such slots.  But if userspace is making memory
-* PRIVATE, then KVM must prevent the guest from accessing the memory
-* as shared.  And if userspace is making memory SHARED and this point
-* is reached, then at least one page within the range was previously
-* PRIVATE, i.e. the slot's possible hugepage ranges are changing.
-* Zapping SPTEs in this case ensures KVM will reassess whether or not
-* a hugepage can be used for affected ranges.
-*/
-   if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
-   return false;
-
return kvm_unmap_gfn_range(kvm, range);
 }
 
@@ -7313,15 +7299,6 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
lockdep_assert_held_write(>mmu_lock);
lockdep_assert_held(>slots_lock);
 
-   /*
-* Calculate which ranges can be mapped with hugepages even if the slot
-* can't map memory PRIVATE.  KVM mustn't create a SHARED hugepage over
-* a range that has PRIVATE GFNs, and conversely converting a range to
-* SHARED may now allow hugepages.
-*/
-   if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
-   return false;
-
/*
 * The sequence matters here: upper levels consume the result of lower
 * level's scanning.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ec32af17add8..85b8648fd892 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2396,6 +2396,8 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
struct kvm_gfn_range *range);
 bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 struct kvm_gfn_range *range);
+int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+ unsigned long attributes);
 
 static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 23633984142f..0096ccfbb609 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2552,7 +2552,7 @@ static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
 }
 
 /* Set @attributes for the gfn range [@start, @end). */
-static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 unsigned long attributes)
 {
struct kvm_mmu_notifier_range pre_set_range = {
@@ -2577,11 +2577,11 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, 
gfn_t start, gfn_t end,
 
entry = attributes ? xa_mk_value(attributes) : NULL;
 
-   mutex_lock(>slots_lock);
+   lockdep_assert_held(>slots_arch_lock);
 
/* Nothing to do if the entire range as the desired attributes. */
if (kvm_range_has_memory_attributes(kvm, start, end, attributes))
-   goto out_unlock;
+   return r;
 
/*
 * Reserve memory ahead of time to avoid having to deal with failures
@@ -2590,7 +2590,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, 
gfn_t start, gfn_t end,
for (i = start; i < end; i++) {
r = xa_reserve(>mem_attr_array, i, GFP_KERNEL_ACCOUNT);
if (r)
-   goto out_unlock;
+   return r;
}
 
kvm_handle_gfn_ran

[RFC PATCH v2 02/19] KVM: x86: Add new hypercall to lock control registers

2023-11-12 Thread Mickaël Salaün
This enables guests to lock their CR0 and CR4 registers with a subset of
X86_CR0_WP, X86_CR4_SMEP, X86_CR4_SMAP, X86_CR4_UMIP, X86_CR4_FSGSBASE
and X86_CR4_CET flags.

The new KVM_HC_LOCK_CR_UPDATE hypercall takes three arguments.  The
first is to identify the control register, the second is a bit mask to
pin (i.e. mark as read-only), and the third is for optional flags.

These register flags should already be pinned by Linux guests, but once
compromised, this self-protection mechanism could be disabled, which is
not the case with this dedicated hypercall.

Once the CRs are pinned by the guest, if it attempts to change them,
then a general protection fault is sent to the guest.

This hypercall may evolve and support new kind of registers or pinning.
The optional KVM_LOCK_CR_UPDATE_VERSION flag enables guests to know the
supported abilities by mapping the returned version with the related
features.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Signed-off-by: Mickaël Salaün 
---

Changes since v1:
* Guard KVM_HC_LOCK_CR_UPDATE hypercall with CONFIG_HEKI.
* Move extern cr4_pinned_mask to x86.h (suggested by Kees Cook).
* Move VMX CR checks from vmx_set_cr*() to handle_cr() to make it
  possible to return to user space (see next commit).
* Change the heki_check_cr()'s first argument to vcpu.
* Don't use -KVM_EPERM in heki_check_cr().
* Generate a fault when the guest requests a denied CR update.
* Add a flags argument to get the version of this hypercall. Being able
  to do a preper version check was suggested by Wei Liu.
---
 Documentation/virt/kvm/x86/hypercalls.rst | 17 +
 arch/x86/include/uapi/asm/kvm_para.h  |  2 +
 arch/x86/kernel/cpu/common.c  |  4 +-
 arch/x86/kvm/vmx/vmx.c|  5 ++
 arch/x86/kvm/x86.c| 84 +++
 arch/x86/kvm/x86.h| 22 ++
 include/linux/kvm_host.h  |  5 ++
 include/uapi/linux/kvm_para.h |  1 +
 8 files changed, 139 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/x86/hypercalls.rst 
b/Documentation/virt/kvm/x86/hypercalls.rst
index 10db7924720f..3178576f4c47 100644
--- a/Documentation/virt/kvm/x86/hypercalls.rst
+++ b/Documentation/virt/kvm/x86/hypercalls.rst
@@ -190,3 +190,20 @@ the KVM_CAP_EXIT_HYPERCALL capability. Userspace must 
enable that capability
 before advertising KVM_FEATURE_HC_MAP_GPA_RANGE in the guest CPUID.  In
 addition, if the guest supports KVM_FEATURE_MIGRATION_CONTROL, userspace
 must also set up an MSR filter to process writes to MSR_KVM_MIGRATION_CONTROL.
+
+9. KVM_HC_LOCK_CR_UPDATE
+
+
+:Architecture: x86
+:Status: active
+:Purpose: Request some control registers to be restricted.
+
+- a0: identify a control register
+- a1: bit mask to make some flags read-only
+- a2: optional KVM_LOCK_CR_UPDATE_VERSION flag that will return the version of
+  this hypercall. Version 1 supports CR0 and CR4 pinning.
+
+The hypercall lets a guest request control register flags to be pinned for
+itself.
+
+Returns 0 on success or a KVM error code otherwise.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 6e64b27b2c1e..efc5ccc0060f 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -150,4 +150,6 @@ struct kvm_vcpu_pv_apf_data {
 #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK
 #define KVM_PV_EOI_DISABLED 0x0
 
+#define KVM_LOCK_CR_UPDATE_VERSION (1 << 0)
+
 #endif /* _UAPI_ASM_X86_KVM_PARA_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 4e5ffc8b0e46..f18ee7ce0496 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -400,9 +400,11 @@ static __always_inline void setup_umip(struct cpuinfo_x86 
*c)
 }
 
 /* These bits should not change their value after CPU init is finished. */
-static const unsigned long cr4_pinned_mask =
+const unsigned long cr4_pinned_mask =
X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |
X86_CR4_FSGSBASE | X86_CR4_CET;
+EXPORT_SYMBOL_GPL(cr4_pinned_mask);
+
 static DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
 static unsigned long cr4_pinned_bits __ro_after_init;
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6e502ba93141..f487bf16dd96 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5452,6 +5452,11 @@ static int handle_cr(struct kvm_vcpu *vcpu)
case 0: /* mov to cr */
val = kvm_register_read(vcpu, reg);
trace_kvm_cr_write(cr, val);
+
+   ret = heki_check_cr(vcpu, cr, val);
+   if (ret)
+   return ret;
+
switch (cr) {
case 0:
err = handle_set_cr0(vcpu, val);
diff --git 

Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

2023-06-02 Thread Mickaël Salaün



On 31/05/2023 22:24, Sean Christopherson wrote:

On Tue, May 30, 2023, Rick P Edgecombe wrote:

On Fri, 2023-05-26 at 17:22 +0200, Micka�l Sala�n wrote:

Can the guest kernel ask the host VMM's emulated devices to DMA into
the protected data? It should go through the host userspace mappings I
think, which don't care about EPT permissions. Or did I miss where you
are protecting that another way? There are a lot of easy ways to ask
the host to write to guest memory that don't involve the EPT. You
probably need to protect the host userspace mappings, and also the
places in KVM that kmap a GPA provided by the guest.


Good point, I'll check this confused deputy attack. Extended KVM
protections should indeed handle all ways to map guests' memory.  I'm
wondering if current VMMs would gracefully handle such new restrictions
though.


I guess the host could map arbitrary data to the guest, so that need to be
handled, but how could the VMM (not the host kernel) bypass/update EPT
initially used for the guest (and potentially later mapped to the host)?


Well traditionally both QEMU and KVM accessed guest memory via host
mappings instead of the EPT.�So I'm wondering what is stopping the
guest from passing a protected gfn when setting up the DMA, and QEMU
being enticed to write to it? The emulator as well would use these host
userspace mappings and not consult the EPT IIRC.

I think Sean was suggesting host userspace should be more involved in
this process, so perhaps it could protect its own alias of the
protected memory, for example mprotect() it as read-only.


Ya, though "suggesting" is really "demanding, unless someone provides super 
strong
justification for handling this directly in KVM".  It's basically the same 
argument
that led to Linux Security Modules: I'm all for KVM providing the framework and
plumbing, but I don't want KVM to get involved in defining policy, thread 
models, etc.


I agree that KVM should not provide its own policy but only the building 
blocks to enforce one. There is two complementary points:

- policy definition by the guest, provided to KVM and the host;
- policy enforcement by KVM and the host.

A potential extension of this framework could be to enable the host to 
define it's own policy for guests, but this would be a different threat 
model.


To avoid too much latency because of the host being involved in policy 
enforcement, I'd like to explore an asynchronous approach that would 
especially fit well for dynamic restrictions.




Re: [ANNOUNCE] KVM Microconference at LPC 2023

2023-06-01 Thread Mickaël Salaün

Hi,

What is the status of this microconference proposal? We'd be happy to 
talk about Heki [1] and potentially other hypervisor supports.


Regards,
 Mickaël


[1] https://lore.kernel.org/all/20230505152046.6575-1-...@digikod.net/


On 26/05/2023 18:09, Mickaël Salaün wrote:

See James Morris's proposal here:
https://lore.kernel.org/all/17f62cb1-a5de-2020-2041-359b8e96b...@linux.microsoft.com/

On 26/05/2023 04:36, James Morris wrote:
  > [Side topic]
  >
  > Would folks be interested in a Linux Plumbers Conference MC on this
  > topic generally, across different hypervisors, VMMs, and architectures?
  >
  > If so, please let me know who the key folk would be and we can try
writing
  > up an MC proposal.

The fine-grain memory management proposal from James Gowans looks
interesting, especially the "side-car" virtual machines:
https://lore.kernel.org/all/88db2d9cb42e471692ff1feb0b9ca855906a9d95.ca...@amazon.com/


On 09/05/2023 11:55, Paolo Bonzini wrote:

Hi all!

We are planning on submitting a CFP to host a KVM Microconference at
Linux Plumbers Conference 2023. To help justify the proposal, we would
like to gather a list of folks that would likely attend, and crowdsource
a list of topics to include in the proposal.

For both this year and future years, the intent is that a KVM
Microconference will complement KVM Forum, *NOT* supplant it. As you
probably noticed, KVM Forum is going through a somewhat radical change in
how it's organized; the conference is now free and (with some help from
Red Hat) organized directly by the KVM and QEMU communities. Despite the
unexpected changes and some teething pains, community response to KVM
Forum continues to be overwhelmingly positive! KVM Forum will remain
the venue of choice for KVM/userspace collaboration, for educational
content covering both KVM and userspace, and to discuss new features in
QEMU and other userspace projects.

At least on the x86 side, however, the success of KVM Forum led us
virtualization folks to operate in relative isolation. KVM depends on
and impacts multiple subsystems (MM, scheduler, perf) in profound ways,
and recently we’ve seen more and more ideas/features that require
non-trivial changes outside KVM and buy-in from stakeholders that
(typically) do not attend KVM Forum. Linux Plumbers Conference is a
natural place to establish such collaboration within the kernel.

Therefore, the aim of the KVM Microconference will be:
* to provide a setting in which to discuss KVM and kernel internals
* to increase collaboration and reduce friction with other subsystems
* to discuss system virtualization issues that require coordination with
other subsystems (such as VFIO, or guest support in arch/)

Below is a rough draft of the planned CFP submission.

Thanks!

Paolo Bonzini (KVM Maintainer)
Sean Christopherson (KVM x86 Co-Maintainer)
Marc Zyngier (KVM ARM Co-Maintainer)


===
KVM Microconference
===

KVM (Kernel-based Virtual Machine) enables the use of hardware features
to improve the efficiency, performance, and security of virtual machines
created and managed by userspace.  KVM was originally developed to host
and accelerate "full" virtual machines running a traditional kernel and
operating system, but has long since expanded to cover a wide array of use
cases, e.g. hosting real time workloads, sandboxing untrusted workloads,
deprivileging third party code, reducing the trusted computed base of
security sensitive workloads, etc.  As KVM's use cases have grown, so too
have the requirements placed on KVM and the interactions between it and
other kernel subsystems.

The KVM Microconference will focus on how to evolve KVM and adjacent
subsystems in order to satisfy new and upcoming requirements: serving
guest memory that cannot be accessed by host userspace[1], providing
accurate, feature-rich PMU/perf virtualization in cloud VMs[2], etc.


Potential Topics:
 - Serving inaccessible/unmappable memory for KVM guests (protected VMs)
 - Optimizing mmu_notifiers, e.g. reducing TLB flushes and spurious zapping
 - Supporting multiple KVM modules (for non-disruptive upgrades)
 - Improving and hardening KVM+perf interactions
 - Implementing arch-agnostic abstractions in KVM (e.g. MMU)
 - Defining KVM requirements for hardware vendors
 - Utilizing "fault" injection to increase test coverage of edge cases
 - KVM vs VFIO (e.g. memory types, a rather hot topic on the ARM side)


Key Attendees:
 - Paolo Bonzini  (KVM Maintainer)
 - Sean Christopherson   (KVM x86 Co-Maintainer)
 - Your name could be here!

[1] 
https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.p...@linux.intel.com
[2] 
https://lore.kernel.org/all/CALMp9eRBOmwz=mspp0m5q093k3rmueasf3vel39mgv5br9w...@mail.gmail.com






Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

2023-05-30 Thread Mickaël Salaün



On 25/05/2023 20:34, Trilok Soni wrote:

On 5/25/2023 6:25 AM, Mickaël Salaün wrote:


On 24/05/2023 23:04, Trilok Soni wrote:

On 5/5/2023 8:20 AM, Mickaël Salaün wrote:

Hi,

This patch series is a proof-of-concept that implements new KVM features
(extended page tracking, MBEC support, CR pinning) and defines a new
API to
protect guest VMs. No VMM (e.g., Qemu) modification is required.

The main idea being that kernel self-protection mechanisms should be
delegated
to a more privileged part of the system, hence the hypervisor. It is
still the
role of the guest kernel to request such restrictions according to its


Only for the guest kernel images here? Why not for the host OS kernel?


As explained in the Future work section, protecting the host would be
useful, but that doesn't really fit with the KVM model. The Protected
KVM project is a first step to help in this direction [11].

In a nutshell, KVM is close to a type-2 hypervisor, and the host kernel
is also part of the hypervisor.



Embedded devices w/ Android you have mentioned below supports the host
OS as well it seems, right?


What do you mean?


I think you have answered this above w/ pKVM and I was referring the
host protection as well w/ Heki. The link/references below refers to the
Android OS it seems and not guest VM.






Do we suggest that all the functionalities should be implemented in the
Hypervisor (NS-EL2 for ARM) or even at Secure EL like Secure-EL1 (ARM).


KVM runs in EL2. TrustZone is mainly used to enforce DRM, which means
that we may not control the related code.

This patch series is dedicated to hypervisor-enforced kernel integrity,
then KVM.



I am hoping that whatever we suggest the interface here from the Guest
to the Hypervisor becomes the ABI right?


Yes, hypercalls are part of the KVM ABI.


Sure. I just hope that they are extensible enough to support for other
Hypervisors too. I am not sure if they are on this list like ACRN / Xen
and see if it fits their need too.


KVM, Hyper-V and Xen mailing lists are CCed. The KVM hypercalls are 
specific to KVM, but this patch series also include a common guest API 
intended to be used with all hypervisors.





Is there any other Hypervisor you plan to test this feature as well?


We're also working on Hyper-V.










# Current limitations

The main limitation of this patch series is the statically enforced
permissions. This is not an issue for kernels without module but this
needs to
be addressed.  Mechanisms that dynamically impact kernel executable
memory are
not handled for now (e.g., kernel modules, tracepoints, eBPF JIT),
and such
code will need to be authenticated.  Because the hypervisor is highly
privileged and critical to the security of all the VMs, we don't want to
implement a code authentication mechanism in the hypervisor itself
but delegate
this verification to something much less privileged. We are thinking
of two
ways to solve this: implement this verification in the VMM or spawn a
dedicated
special VM (similar to Windows's VBS). There are pros on cons to each
approach:
complexity, verification code ownership (guest's or VMM's), access to
guest
memory (i.e., confidential computing).


Do you foresee the performance regressions due to lot of tracking here?


The performance impact of execution prevention should be negligible
because once configured the hypervisor do nothing except catch
illegitimate access attempts.


Yes, if you are using the static kernel only and not considering the
other dynamic patching features like explained. They need to be thought
upon differently to reduce the likely impact.


What do you mean? We plan to support dynamic code, and performance is of 
course part of the requirement.









Production kernels do have lot of tracepoints and we use it as feature
in the GKI kernel for the vendor hooks implementation and in those cases
every vendor driver is a module.


As explained in this section, dynamic kernel modifications such as
tracepoints or modules are not currently supported by this patch series.
Handling tracepoints is possible but requires more work to define and
check legitimate changes. This proposal is still useful for static
kernels though.



Separate VM further fragments this
design and delegates more of it to proprietary solutions?


What do you mean? KVM is not a proprietary solution.


Ah, I was referring the VBS Windows VM mentioned in the above text. Is
it open-source? The reference of VM (or dedicated VM) didn't mention
that VM itself will be open-source running Linux kernel.


This patch series is dedicated to KVM. Windows VBS was only mentioned as 
a comparable (but much more advanced) set of features. Everything 
required to use this new KVM features is and will be open-source. There 
is nothing to worry about licensing, the goal is to make it widely and 
freely available to protect users.







For dynamic checks, this would require code not run by KVM itself, but
either the VMM or a dedicated VM

Re: [PATCH v1 5/9] KVM: x86: Add new hypercall to lock control registers

2023-05-29 Thread Mickaël Salaün



On 08/05/2023 23:11, Wei Liu wrote:

On Fri, May 05, 2023 at 05:20:42PM +0200, Mickaël Salaün wrote:

This enables guests to lock their CR0 and CR4 registers with a subset of
X86_CR0_WP, X86_CR4_SMEP, X86_CR4_SMAP, X86_CR4_UMIP, X86_CR4_FSGSBASE
and X86_CR4_CET flags.

The new KVM_HC_LOCK_CR_UPDATE hypercall takes two arguments.  The first
is to identify the control register, and the second is a bit mask to
pin (i.e. mark as read-only).

These register flags should already be pinned by Linux guests, but once
compromised, this self-protection mechanism could be disabled, which is
not the case with this dedicated hypercall.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Signed-off-by: Mickaël Salaün 
Link: https://lore.kernel.org/r/20230505152046.6575-6-...@digikod.net

[...]

hw_cr4 = (cr4_read_shadow() & X86_CR4_MCE) | (cr4 & ~X86_CR4_MCE);
if (is_unrestricted_guest(vcpu))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ffab64d08de3..a529455359ac 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7927,11 +7927,77 @@ static unsigned long emulator_get_cr(struct 
x86_emulate_ctxt *ctxt, int cr)
return value;
  }
  
+#ifdef CONFIG_HEKI

+
+extern unsigned long cr4_pinned_mask;
+


Can this be moved to a header file?


Yep, but I'm not sure which one. Any preference Kees?





+static int heki_lock_cr(struct kvm *const kvm, const unsigned long cr,
+   unsigned long pin)
+{
+   if (!pin)
+   return -KVM_EINVAL;
+
+   switch (cr) {
+   case 0:
+   /* Cf. arch/x86/kernel/cpu/common.c */
+   if (!(pin & X86_CR0_WP))
+   return -KVM_EINVAL;
+
+   if ((read_cr0() & pin) != pin)
+   return -KVM_EINVAL;
+
+   atomic_long_or(pin, >heki_pinned_cr0);
+   return 0;
+   case 4:
+   /* Checks for irrelevant bits. */
+   if ((pin & cr4_pinned_mask) != pin)
+   return -KVM_EINVAL;
+


It is enforcing the host mask on the guest, right? If the guest's set is a
super set of the host's then it will get rejected.



+   /* Ignores bits not present in host. */
+   pin &= __read_cr4();
+   atomic_long_or(pin, >heki_pinned_cr4);


We assume that the host's mask is a superset of the guest's mask. I 
guess we should check the absolute supported bits instead, even if it 
would be weird for the host to not support these bits.




+   return 0;
+   }
+   return -KVM_EINVAL;
+}
+
+int heki_check_cr(const struct kvm *const kvm, const unsigned long cr,
+ const unsigned long val)
+{
+   unsigned long pinned;
+
+   switch (cr) {
+   case 0:
+   pinned = atomic_long_read(>heki_pinned_cr0);
+   if ((val & pinned) != pinned) {
+   pr_warn_ratelimited(
+   "heki-kvm: Blocked CR0 update: 0x%lx\n", val);


I think if the message contains the VM and VCPU identifier it will
become more useful.


Indeed, and this should be the case for all log messages, but I'd left 
that for future work. ;) I'll update the logs for the next series with a 
new kvm_warn_ratelimited() helper using VCPU's PID.




Re: [PATCH v1 3/9] virt: Implement Heki common code

2023-05-29 Thread Mickaël Salaün



On 17/05/2023 14:47, Madhavan T. Venkataraman wrote:

Sorry for the delay. See inline...

On 5/8/23 12:29, Wei Liu wrote:

On Fri, May 05, 2023 at 05:20:40PM +0200, Mickaël Salaün wrote:

From: Madhavan T. Venkataraman 

Hypervisor Enforced Kernel Integrity (Heki) is a feature that will use
the hypervisor to enhance guest virtual machine security.

Configuration
=

Define the config variables for the feature. This feature depends on
support from the architecture as well as the hypervisor.

Enabling HEKI
=

Define a kernel command line parameter "heki" to turn the feature on or
off. By default, Heki is on.


For such a newfangled feature can we have it off by default? Especially
when there are unsolved issues around dynamically loaded code.



Yes. We can certainly do that.


By default the Kconfig option should definitely be off. We also need to 
change the Kconfig option to only be set if kernel module, JIT, kprobes 
and other dynamic text change feature are disabled at build time  (see 
discussion with Sean).


With this new Kconfig option for the static case, I think the boot 
option should be on by default because otherwise it would not really be 
possible to switch back to on later without taking the risk to silently 
breaking users' machines. However, we should rename this option to 
something like "heki_static" to be in line with the new Kconfig option.


The goal of Heki is to improve and complement kernel self-protection 
mechanisms (which don't have boot time options), and to make it 
available to everyone, see 
https://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project/Recommended_Settings
In practice, it would then be kind of useless to be required to set a 
boot option to enable Heki (rather than to disable it).








[...]

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3604074a878b..5cf5a7a97811 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -297,6 +297,7 @@ config X86
select FUNCTION_ALIGNMENT_4B
imply IMA_SECURE_AND_OR_TRUSTED_BOOTif EFI
select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
+   select ARCH_SUPPORTS_HEKI   if X86_64


Why is there a restriction on X86_64?



We want to get the PoC working and reviewed on X64 first. We have tested this 
only on X64 so far.


X86_64 includes Intel CPUs, which can support EPT and MBEC, which are a 
requirement for Heki. ARM might have similar features but we're focused 
on x86 for now.


As a side note, I only have access to an Intel machine, which means that 
I cannot work on AMD support. However, I'll be pleased to implement such 
support if I get access to a machine with a recent AMD CPU.





  
  config INSTRUCTION_DECODER

def_bool y
diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index a6e8373a5170..42ef1e33b8a5 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h

[...]
  
+#ifdef CONFIG_HEKI

+
+/*
+ * Gather all of the statically defined sections so heki_late_init() can
+ * protect these sections in the host page table.
+ *
+ * The sections are defined under "SECTIONS" in vmlinux.lds.S
+ * Keep this array in sync with SECTIONS.
+ */


This seems a bit fragile, because it requires constant attention from
people who care about this functionality. Can this table be
automatically generated?



We realize that. But I don't know of a way this can be automatically generated. 
Also, the permissions for
each section is specific to the use of that section. The developer who 
introduces a new section is the
one who will know what the permissions should be.

If any one has any ideas of how we can generate this table automatically or 
even just add a build time check
of some sort, please let us know.


One clean solution might be to parse the vmlinux.lds.S file, extract 
section and their permission, and fill that into an automatically 
generated header file.


Another way to do it would be to extract sections and associated 
permissions with objdump, but that could be an issue because of longer 
build time.


A better solution would be to extract such sections and associated 
permissions at boot time. I guess the kernel already has such helpers 
used in early boot.




Re: [PATCH v1 6/9] KVM: x86: Add Heki hypervisor support

2023-05-26 Thread Mickaël Salaün



On 08/05/2023 23:18, Wei Liu wrote:

On Fri, May 05, 2023 at 05:20:43PM +0200, Mickaël Salaün wrote:

From: Madhavan T. Venkataraman 

Each supported hypervisor in x86 implements a struct x86_hyper_init to
define the init functions for the hypervisor.  Define a new init_heki()
entry point in struct x86_hyper_init.  Hypervisors that support Heki
must define this init_heki() function.  Call init_heki() of the chosen
hypervisor in init_hypervisor_platform().

Create a heki_hypervisor structure that each hypervisor can fill
with its data and functions. This will allow the Heki feature to work
in a hypervisor agnostic way.

Declare and initialize a "heki_hypervisor" structure for KVM so KVM can
support Heki.  Define the init_heki() function for KVM.  In init_heki(),
set the hypervisor field in the generic "heki" structure to the KVM
"heki_hypervisor".  After this point, generic Heki code can access the
KVM Heki data and functions.


[...]

+static void kvm_init_heki(void)
+{
+   long err;
+
+   if (!kvm_para_available())
+   /* Cannot make KVM hypercalls. */
+   return;
+
+   err = kvm_hypercall3(KVM_HC_LOCK_MEM_PAGE_RANGES, -1, -1, -1);


Why not do a proper version check or capability check here? If the ABI
or supported features ever change then we have something to rely on?


The attributes will indeed get extended, but I wanted to have a simple 
proposal for now.


Do you mean to get the version of this hypercall e.g., with a dedicated 
flag, like with the 
landlock_create_ruleset/LANDLOCK_CREATE_RULESET_VERSION syscall?





Thanks,
Wei.




Re: [ANNOUNCE] KVM Microconference at LPC 2023

2023-05-26 Thread Mickaël Salaün
See James Morris's proposal here: 
https://lore.kernel.org/all/17f62cb1-a5de-2020-2041-359b8e96b...@linux.microsoft.com/


On 26/05/2023 04:36, James Morris wrote:
> [Side topic]
>
> Would folks be interested in a Linux Plumbers Conference MC on this
> topic generally, across different hypervisors, VMMs, and architectures?
>
> If so, please let me know who the key folk would be and we can try 
writing

> up an MC proposal.

The fine-grain memory management proposal from James Gowans looks 
interesting, especially the "side-car" virtual machines: 
https://lore.kernel.org/all/88db2d9cb42e471692ff1feb0b9ca855906a9d95.ca...@amazon.com/



On 09/05/2023 11:55, Paolo Bonzini wrote:

Hi all!

We are planning on submitting a CFP to host a KVM Microconference at
Linux Plumbers Conference 2023. To help justify the proposal, we would
like to gather a list of folks that would likely attend, and crowdsource
a list of topics to include in the proposal.

For both this year and future years, the intent is that a KVM
Microconference will complement KVM Forum, *NOT* supplant it. As you
probably noticed, KVM Forum is going through a somewhat radical change in
how it's organized; the conference is now free and (with some help from
Red Hat) organized directly by the KVM and QEMU communities. Despite the
unexpected changes and some teething pains, community response to KVM
Forum continues to be overwhelmingly positive! KVM Forum will remain
the venue of choice for KVM/userspace collaboration, for educational
content covering both KVM and userspace, and to discuss new features in
QEMU and other userspace projects.

At least on the x86 side, however, the success of KVM Forum led us
virtualization folks to operate in relative isolation. KVM depends on
and impacts multiple subsystems (MM, scheduler, perf) in profound ways,
and recently we’ve seen more and more ideas/features that require
non-trivial changes outside KVM and buy-in from stakeholders that
(typically) do not attend KVM Forum. Linux Plumbers Conference is a
natural place to establish such collaboration within the kernel.

Therefore, the aim of the KVM Microconference will be:
* to provide a setting in which to discuss KVM and kernel internals
* to increase collaboration and reduce friction with other subsystems
* to discuss system virtualization issues that require coordination with
other subsystems (such as VFIO, or guest support in arch/)

Below is a rough draft of the planned CFP submission.

Thanks!

Paolo Bonzini (KVM Maintainer)
Sean Christopherson (KVM x86 Co-Maintainer)
Marc Zyngier (KVM ARM Co-Maintainer)


===
KVM Microconference
===

KVM (Kernel-based Virtual Machine) enables the use of hardware features
to improve the efficiency, performance, and security of virtual machines
created and managed by userspace.  KVM was originally developed to host
and accelerate "full" virtual machines running a traditional kernel and
operating system, but has long since expanded to cover a wide array of use
cases, e.g. hosting real time workloads, sandboxing untrusted workloads,
deprivileging third party code, reducing the trusted computed base of
security sensitive workloads, etc.  As KVM's use cases have grown, so too
have the requirements placed on KVM and the interactions between it and
other kernel subsystems.

The KVM Microconference will focus on how to evolve KVM and adjacent
subsystems in order to satisfy new and upcoming requirements: serving
guest memory that cannot be accessed by host userspace[1], providing
accurate, feature-rich PMU/perf virtualization in cloud VMs[2], etc.


Potential Topics:
- Serving inaccessible/unmappable memory for KVM guests (protected VMs)
- Optimizing mmu_notifiers, e.g. reducing TLB flushes and spurious zapping
- Supporting multiple KVM modules (for non-disruptive upgrades)
- Improving and hardening KVM+perf interactions
- Implementing arch-agnostic abstractions in KVM (e.g. MMU)
- Defining KVM requirements for hardware vendors
- Utilizing "fault" injection to increase test coverage of edge cases
- KVM vs VFIO (e.g. memory types, a rather hot topic on the ARM side)


Key Attendees:
- Paolo Bonzini  (KVM Maintainer)
- Sean Christopherson   (KVM x86 Co-Maintainer)
- Your name could be here!

[1] 
https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.p...@linux.intel.com
[2] 
https://lore.kernel.org/all/CALMp9eRBOmwz=mspp0m5q093k3rmueasf3vel39mgv5br9w...@mail.gmail.com






Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

2023-05-26 Thread Mickaël Salaün



On 25/05/2023 17:52, Edgecombe, Rick P wrote:

On Thu, 2023-05-25 at 15:59 +0200, Mickaël Salaün wrote:
[ snip ]


The kernel often creates writable aliases in order to write to
protected data (kernel text, etc). Some of this is done right as
text
is being first written out (alternatives for example), and some
happens
way later (jump labels, etc). So for verification, I wonder what
stage
you would be verifying? If you want to verify the end state, you
would
have to maintain knowledge in the verifier of all the touch-ups the
kernel does. I think it would get very tricky.


For now, in the static kernel case, all rodata and text GPA is
restricted, so aliasing such memory in a writable way before or after
the KVM enforcement would still restrict write access to this memory,
which could be an issue but not a security one. Do you have such
examples in mind?



On x86, look at all the callers of the text_poke() family. In
arch/x86/include/asm/text-patching.h.


OK, thanks!








It also seems it will be a decent ask for the guest kernel to keep
track of GPA permissions as well as normal virtual memory
pemirssions,
if this thing is not widely used.


This would indeed be required to properly handle the dynamic cases.




So I wondering if you could go in two directions with this:
1. Make this a feature only for super locked down kernels (no
modules,
etc). Forbid any configurations that might modify text. But eBPF is
used for seccomp, so you might be turning off some security
protections
to get this.


Good idea. For "super locked down kernels" :) , we should disable all
kernel executable changes with the related kernel build configuration
(e.g. eBPF JIT, kernel module, kprobes…) to make sure there is no
such
legitimate access. This looks like an acceptable initial feature.


How many users do you think will want this protection but not
protections that would have to be disabled? The main one that came to
mind for me is cBPF seccomp stuff.

But also, the alternative to JITing cBPF is the eBPF interpreter which
AFAIU is considered a juicy enough target for speculative attacks that
they created an option to compile it out. And leaving an interpreter in
the kernel means any data could be "executed" in the normal non-
speculative scenario, kind of working around the hypervisor executable
protections. Dropping e/cBPF entirely would be an option, but then I
wonder how many users you have left. Hopefully that is all correct,
it's hard to keep track with the pace of BPF development.


seccomp-bpf doesn't rely on JIT, so it is not an issue. For eBPF, JIT is 
optional, but other text changes may be required according to the eBPF 
program type (e.g. using kprobes).





I wonder if it might be a good idea to POC the guest side before
settling on the KVM interface. Then you can also look at the whole
thing and judge how much usage it would get for the different options
of restrictions.


The next step is to handle dynamic permissions, but it will be easier to 
first implement that in KVM itself (which already has the required 
authentication code). The current interface may be flexible enough 
though, only new attribute flags should be required (and potentially an 
async mode). Anyway, this will enable to look at the whole thing.









2. Loosen the rules to allow the protections to not be so one-way
enable. Get less security, but used more widely.


This is our goal. I think both static and dynamic cases are
legitimate
and have value according to the level of security sought. This should
be
a build-time configuration.


Yea, the proper way to do this is probably to move all text handling
stuff into a separate domain of some sort, like you mentioned
elsewhere. It would be quite a job.


Not necessarily to move this code, but to make sure that the changes are 
legitimate (e.g. text signatures, legitimate addresses). This doesn't 
need to be perfect but it should improve the current state by increasing 
the cost of attacks.




Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

2023-05-26 Thread Mickaël Salaün



On 25/05/2023 15:59, Mickaël Salaün wrote:


On 25/05/2023 00:20, Edgecombe, Rick P wrote:

On Fri, 2023-05-05 at 17:20 +0200, Mickaël Salaün wrote:

# How does it work?

This implementation mainly leverages KVM capabilities to control the
Second
Layer Address Translation (or the Two Dimensional Paging e.g.,
Intel's EPT or
AMD's RVI/NPT) and Mode Based Execution Control (Intel's MBEC)
introduced with
the Kaby Lake (7th generation) architecture. This allows to set
permissions on
memory pages in a complementary way to the guest kernel's managed
memory
permissions. Once these permissions are set, they are locked and
there is no
way back.

A first KVM_HC_LOCK_MEM_PAGE_RANGES hypercall enables the guest
kernel to lock
a set of its memory page ranges with either the HEKI_ATTR_MEM_NOWRITE
or the
HEKI_ATTR_MEM_EXEC attribute. The first one denies write access to a
specific
set of pages (allow-list approach), and the second only allows kernel
execution
for a set of pages (deny-list approach).

The current implementation sets the whole kernel's .rodata (i.e., any
const or
__ro_after_init variables, which includes critical security data such
as LSM
parameters) and .text sections as non-writable, and the .text section
is the
only one where kernel execution is allowed. This is possible thanks
to the new
MBEC support also brough by this series (otherwise the vDSO would
have to be
executable). Thanks to this hardware support (VT-x, EPT and MBEC),
the
performance impact of such guest protection is negligible.

The second KVM_HC_LOCK_CR_UPDATE hypercall enables guests to pin some
of its
CPU control register flags (e.g., X86_CR0_WP, X86_CR4_SMEP,
X86_CR4_SMAP),
which is another complementary hardening mechanism.

Heki can be enabled with the heki=1 boot command argument.




Can the guest kernel ask the host VMM's emulated devices to DMA into
the protected data? It should go through the host userspace mappings I
think, which don't care about EPT permissions. Or did I miss where you
are protecting that another way? There are a lot of easy ways to ask
the host to write to guest memory that don't involve the EPT. You
probably need to protect the host userspace mappings, and also the
places in KVM that kmap a GPA provided by the guest.


Good point, I'll check this confused deputy attack. Extended KVM
protections should indeed handle all ways to map guests' memory. I'm
wondering if current VMMs would gracefully handle such new restrictions
though.


I guess the host could map arbitrary data to the guest, so that need to 
be handled, but how could the VMM (not the host kernel) bypass/update 
EPT initially used for the guest (and potentially later mapped to the host)?




Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

2023-05-25 Thread Mickaël Salaün



On 25/05/2023 00:20, Edgecombe, Rick P wrote:

On Fri, 2023-05-05 at 17:20 +0200, Mickaël Salaün wrote:

# How does it work?

This implementation mainly leverages KVM capabilities to control the
Second
Layer Address Translation (or the Two Dimensional Paging e.g.,
Intel's EPT or
AMD's RVI/NPT) and Mode Based Execution Control (Intel's MBEC)
introduced with
the Kaby Lake (7th generation) architecture. This allows to set
permissions on
memory pages in a complementary way to the guest kernel's managed
memory
permissions. Once these permissions are set, they are locked and
there is no
way back.

A first KVM_HC_LOCK_MEM_PAGE_RANGES hypercall enables the guest
kernel to lock
a set of its memory page ranges with either the HEKI_ATTR_MEM_NOWRITE
or the
HEKI_ATTR_MEM_EXEC attribute. The first one denies write access to a
specific
set of pages (allow-list approach), and the second only allows kernel
execution
for a set of pages (deny-list approach).

The current implementation sets the whole kernel's .rodata (i.e., any
const or
__ro_after_init variables, which includes critical security data such
as LSM
parameters) and .text sections as non-writable, and the .text section
is the
only one where kernel execution is allowed. This is possible thanks
to the new
MBEC support also brough by this series (otherwise the vDSO would
have to be
executable). Thanks to this hardware support (VT-x, EPT and MBEC),
the
performance impact of such guest protection is negligible.

The second KVM_HC_LOCK_CR_UPDATE hypercall enables guests to pin some
of its
CPU control register flags (e.g., X86_CR0_WP, X86_CR4_SMEP,
X86_CR4_SMAP),
which is another complementary hardening mechanism.

Heki can be enabled with the heki=1 boot command argument.




Can the guest kernel ask the host VMM's emulated devices to DMA into
the protected data? It should go through the host userspace mappings I
think, which don't care about EPT permissions. Or did I miss where you
are protecting that another way? There are a lot of easy ways to ask
the host to write to guest memory that don't involve the EPT. You
probably need to protect the host userspace mappings, and also the
places in KVM that kmap a GPA provided by the guest.


Good point, I'll check this confused deputy attack. Extended KVM 
protections should indeed handle all ways to map guests' memory. I'm 
wondering if current VMMs would gracefully handle such new restrictions 
though.





[ snip ]



# Current limitations

The main limitation of this patch series is the statically enforced
permissions. This is not an issue for kernels without module but this
needs to
be addressed.  Mechanisms that dynamically impact kernel executable
memory are
not handled for now (e.g., kernel modules, tracepoints, eBPF JIT),
and such
code will need to be authenticated.  Because the hypervisor is highly
privileged and critical to the security of all the VMs, we don't want
to
implement a code authentication mechanism in the hypervisor itself
but delegate
this verification to something much less privileged. We are thinking
of two
ways to solve this: implement this verification in the VMM or spawn a
dedicated
special VM (similar to Windows's VBS). There are pros on cons to each
approach:
complexity, verification code ownership (guest's or VMM's), access to
guest
memory (i.e., confidential computing).


The kernel often creates writable aliases in order to write to
protected data (kernel text, etc). Some of this is done right as text
is being first written out (alternatives for example), and some happens
way later (jump labels, etc). So for verification, I wonder what stage
you would be verifying? If you want to verify the end state, you would
have to maintain knowledge in the verifier of all the touch-ups the
kernel does. I think it would get very tricky.


For now, in the static kernel case, all rodata and text GPA is 
restricted, so aliasing such memory in a writable way before or after 
the KVM enforcement would still restrict write access to this memory, 
which could be an issue but not a security one. Do you have such 
examples in mind?





It also seems it will be a decent ask for the guest kernel to keep
track of GPA permissions as well as normal virtual memory pemirssions,
if this thing is not widely used.


This would indeed be required to properly handle the dynamic cases.




So I wondering if you could go in two directions with this:
1. Make this a feature only for super locked down kernels (no modules,
etc). Forbid any configurations that might modify text. But eBPF is
used for seccomp, so you might be turning off some security protections
to get this.


Good idea. For "super locked down kernels" :) , we should disable all 
kernel executable changes with the related kernel build configuration 
(e.g. eBPF JIT, kernel module, kprobes…) to make sure there is no such 
legitimate access. This looks like an acceptable initial feature.




2. Loosen the rules to allow the protections to not be so one-

Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

2023-05-25 Thread Mickaël Salaün



On 24/05/2023 23:04, Trilok Soni wrote:

On 5/5/2023 8:20 AM, Mickaël Salaün wrote:

Hi,

This patch series is a proof-of-concept that implements new KVM features
(extended page tracking, MBEC support, CR pinning) and defines a new API to
protect guest VMs. No VMM (e.g., Qemu) modification is required.

The main idea being that kernel self-protection mechanisms should be delegated
to a more privileged part of the system, hence the hypervisor. It is still the
role of the guest kernel to request such restrictions according to its


Only for the guest kernel images here? Why not for the host OS kernel?


As explained in the Future work section, protecting the host would be 
useful, but that doesn't really fit with the KVM model. The Protected 
KVM project is a first step to help in this direction [11].


In a nutshell, KVM is close to a type-2 hypervisor, and the host kernel 
is also part of the hypervisor.




Embedded devices w/ Android you have mentioned below supports the host
OS as well it seems, right?


What do you mean?




Do we suggest that all the functionalities should be implemented in the
Hypervisor (NS-EL2 for ARM) or even at Secure EL like Secure-EL1 (ARM).


KVM runs in EL2. TrustZone is mainly used to enforce DRM, which means 
that we may not control the related code.


This patch series is dedicated to hypervisor-enforced kernel integrity, 
then KVM.




I am hoping that whatever we suggest the interface here from the Guest
to the Hypervisor becomes the ABI right?


Yes, hypercalls are part of the KVM ABI.






# Current limitations

The main limitation of this patch series is the statically enforced
permissions. This is not an issue for kernels without module but this needs to
be addressed.  Mechanisms that dynamically impact kernel executable memory are
not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), and such
code will need to be authenticated.  Because the hypervisor is highly
privileged and critical to the security of all the VMs, we don't want to
implement a code authentication mechanism in the hypervisor itself but delegate
this verification to something much less privileged. We are thinking of two
ways to solve this: implement this verification in the VMM or spawn a dedicated
special VM (similar to Windows's VBS). There are pros on cons to each approach:
complexity, verification code ownership (guest's or VMM's), access to guest
memory (i.e., confidential computing).


Do you foresee the performance regressions due to lot of tracking here?


The performance impact of execution prevention should be negligible 
because once configured the hypervisor do nothing except catch 
illegitimate access attempts.




Production kernels do have lot of tracepoints and we use it as feature
in the GKI kernel for the vendor hooks implementation and in those cases
every vendor driver is a module.


As explained in this section, dynamic kernel modifications such as 
tracepoints or modules are not currently supported by this patch series. 
Handling tracepoints is possible but requires more work to define and 
check legitimate changes. This proposal is still useful for static 
kernels though.




Separate VM further fragments this
design and delegates more of it to proprietary solutions?


What do you mean? KVM is not a proprietary solution.

For dynamic checks, this would require code not run by KVM itself, but 
either the VMM or a dedicated VM. In this case, the dynamic 
authentication code could come from the guest VM or from the VMM itself. 
In the former case, it is more challenging from a security point of view 
but doesn't rely on external (proprietary) solution. In the latter case, 
open-source VMMs should implement the specification to provide the 
required service (e.g. check kernel module signature).


The goal of the common API layer provided by this RFC is to share code 
as much as possible between different hypervisor backends.





Do you have any performance numbers w/ current RFC?


No, but the only hypervisor performance impact is at boot time and 
should be negligible. I'll try to get some numbers for the 
hardware-enforcement impact, but it should be negligible too.




Re: [PATCH v1 4/9] KVM: x86: Add new hypercall to set EPT permissions

2023-05-05 Thread Mickaël Salaün



On 05/05/2023 18:44, Sean Christopherson wrote:

On Fri, May 05, 2023, Micka�l Sala�n wrote:

Add a new KVM_HC_LOCK_MEM_PAGE_RANGES hypercall that enables a guest to
set EPT permissions on a set of page ranges.


IMO, manipulation of protections, both for memory (this patch) and CPU state
(control registers in the next patch) should come from userspace.  I have no
objection to KVM providing plumbing if necessary, but I think userspace needs to
to have full control over the actual state.


By user space, do you mean the host user space or the guest user space?

About the guest user space, I see several issues to delegate this kind 
of control:

- These are restrictions only relevant to the kernel.
- The threat model is to protect against user space as early as possible.
- It would be more complex for no obvious gain.

This patch series is an extension of the kernel self-protections 
mechanisms, and they are not configured by user space.





One of the things that caused Intel's control register pinning series to stall
out was how to handle edge cases like kexec() and reboot.  Deferring to 
userspace
means the kernel doesn't need to define policy, e.g. when to unprotect memory,
and avoids questions like "should userspace be able to overwrite pinned control
registers".


The idea is to authenticate every changes. For kexec, the VMM (or 
something else) would have to authenticate the new kernel. Do you have 
something else in mind that could legitimately require such memory or CR 
changes?





And like the confidential VM use case, keeping userspace in the loop is a big
beneifit, e.g. the guest can't circumvent protections by coercing userspace into
writing to protected memory .


I don't understand this part. Are you talking about the host user space? 
How the guest could circumvent protections?




Re: [PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking

2023-05-05 Thread Mickaël Salaün



On 05/05/2023 18:28, Sean Christopherson wrote:

On Fri, May 05, 2023, Micka�l Sala�n wrote:

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index eb186bc57f6a..a7fb4ff888e6 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -3,6 +3,7 @@
  #define _ASM_X86_KVM_PAGE_TRACK_H
  
  enum kvm_page_track_mode {

+   KVM_PAGE_TRACK_PREWRITE,


Heh, just when I decide to finally kill off support for multiple modes[1] :-)

My assessment from that changelog still holds true for this case:

   Drop "support" for multiple page-track modes, as there is no evidence
   that array-based and refcounted metadata is the optimal solution for
   other modes, nor is there any evidence that other use cases, e.g. for
   access-tracking, will be a good fit for the page-track machinery in
   general.
   
   E.g. one potential use case of access-tracking would be to prevent guest

   access to poisoned memory (from the guest's perspective).  In that case,
   the number of poisoned pages is likely to be a very small percentage of
   the guest memory, and there is no need to reference count the number of
   access-tracking users, i.e. expanding gfn_track[] for a new mode would be
   grossly inefficient.  And for poisoned memory, host userspace would also
   likely want to trap accesses, e.g. to inject #MC into the guest, and that
   isn't currently supported by the page-track framework.
   
   A better alternative for that poisoned page use case is likely a

   variation of the proposed per-gfn attributes overlay (linked), which
   would allow efficiently tracking the sparse set of poisoned pages, and by
   default would exit to userspace on access.

Of particular relevance:

   - Using the page-track machinery is inefficient because the guest is likely
 going to write-protect a minority of its memory.  And this

   select KVM_EXTERNAL_WRITE_TRACKING if KVM

 is particularly nasty because simply enabling HEKI in the Kconfig will 
cause
 KVM to allocate rmaps and gfn tracking.

   - There's no need to reference count the protection, i.e. 15 of the 16 bits 
of
 gfn_track are dead weight.

   - As proposed, adding a second "mode" would double the cost of gfn tracking.

   - Tying the protections to the memslots will create an impossible-to-maintain
 ABI because the protections will be lost if the owning memslot is deleted 
and
 recreated.

   - The page-track framework provides incomplete protection and will lead to an
 ongoing game of whack-a-mole, e.g. this patch catches the obvious cases by
 adding calls to kvm_page_track_prewrite(), but misses things like 
kvm_vcpu_map().

   - The scaling and maintenance issues will only get worse if/when someone 
tries
 to support dropping read and/or execute permissions, e.g. for execute-only.

   - The code is x86-only, and is likely to stay that way for the foreseeable
 future.

The proposed alternative is to piggyback the memory attributes implementation[2]
that is being added (if all goes according to plan) for confidential VMs.  This
use case (dropping permissions) came up not too long ago[3], which is why I have
a ready-made answer).

I have no doubt that we'll need to solve performance and scaling issues with the
memory attributes implementation, e.g. to utilize xarray multi-range support
instead of storing information on a per-4KiB-page basis, but AFAICT, the core
idea is sound.  And a very big positive from a maintenance perspective is that
any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should also
benefit the other use case.

[1] https://lore.kernel.org/all/20230311002258.852397-22-sea...@google.com
[2] https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com
[3] https://lore.kernel.org/all/y1a1i9vbj%2fpvm...@google.com


I agree, I used this mechanism because it was easier at first to rely on 
a previous work, but while I was working on the MBEC support, I realized 
that it's not the optimal way to do it.


I was thinking about using a new special EPT bit similar to 
EPT_SPTE_HOST_WRITABLE, but it may not be portable though. What do you 
think?




[PATCH v1 4/9] KVM: x86: Add new hypercall to set EPT permissions

2023-05-05 Thread Mickaël Salaün
Add a new KVM_HC_LOCK_MEM_PAGE_RANGES hypercall that enables a guest to
set EPT permissions on a set of page ranges.

This hypercall takes three arguments.  The first contains the GPA
pointing to an array of struct heki_pa_range.  The second argument is
the size of the array, not the number of elements.  The third argument
is for future proofness and is designed to contains optional flags (e.g.
to change the array type), but must be zero for now.

The struct heki_pa_range contains a GFN that starts the range and
another that is the indicate the last (included) page.  A bit field of
attributes are tied to this range.

The HEKI_ATTR_MEM_NOWRITE attribute is interpreted as a removal of the
EPT write permission to deny any write access from the guest through its
lifetime.  We choose "nowrite" because "read-only" exclude
execution, it follows a deny-list approach, and most importantly because
it is an incremental addition to the status quo (i.e., everything is
allowed from the TDP point of view).  This is implemented thanks to the
KVM_PAGE_TRACK_PREWRITE mode previously introduced.

The page ranges recording is currently implemented with a static array
of 16 elements to make it simple, but this mechanism will be dynamic in
a follow-up.

Define a kernel command line parameter "heki" to turn the feature on or
off.  By default, Heki is turned on.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Signed-off-by: Mickaël Salaün 
Link: https://lore.kernel.org/r/20230505152046.6575-5-...@digikod.net
---
 Documentation/virt/kvm/x86/hypercalls.rst |  17 +++
 arch/x86/kvm/x86.c| 169 ++
 include/linux/kvm_host.h  |  13 ++
 include/uapi/linux/kvm_para.h |   1 +
 virt/kvm/kvm_main.c   |   4 +
 5 files changed, 204 insertions(+)

diff --git a/Documentation/virt/kvm/x86/hypercalls.rst 
b/Documentation/virt/kvm/x86/hypercalls.rst
index 10db7924720f..0ec79cc77f53 100644
--- a/Documentation/virt/kvm/x86/hypercalls.rst
+++ b/Documentation/virt/kvm/x86/hypercalls.rst
@@ -190,3 +190,20 @@ the KVM_CAP_EXIT_HYPERCALL capability. Userspace must 
enable that capability
 before advertising KVM_FEATURE_HC_MAP_GPA_RANGE in the guest CPUID.  In
 addition, if the guest supports KVM_FEATURE_MIGRATION_CONTROL, userspace
 must also set up an MSR filter to process writes to MSR_KVM_MIGRATION_CONTROL.
+
+9. KVM_HC_LOCK_MEM_PAGE_RANGES
+--
+
+:Architecture: x86
+:Status: active
+:Purpose: Request memory page ranges to be restricted.
+
+- a0: physical address of a struct heki_pa_range array
+- a1: size of the array
+- a2: optional flags, must be 0 for now
+
+The hypercall lets a guest request memory permissions to be removed for itself,
+identified with set of physical page ranges (GFNs).  The HEKI_ATTR_MEM_NOWRITE
+memory page range attribute forbids related modification to the guest.
+
+Returns 0 on success or a KVM error code otherwise.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fd05f42c9913..ffab64d08de3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -59,6 +59,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -9596,6 +9597,161 @@ static void kvm_sched_yield(struct kvm_vcpu *vcpu, 
unsigned long dest_id)
return;
 }
 
+#ifdef CONFIG_HEKI
+
+static int heki_page_track_add(struct kvm *const kvm, const gfn_t gfn,
+  const enum kvm_page_track_mode mode)
+{
+   struct kvm_memory_slot *slot;
+   int idx;
+
+   BUILD_BUG_ON(!IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING));
+
+   idx = srcu_read_lock(>srcu);
+   slot = gfn_to_memslot(kvm, gfn);
+   if (!slot) {
+   srcu_read_unlock(>srcu, idx);
+   return -EINVAL;
+   }
+
+   write_lock(>mmu_lock);
+   kvm_slot_page_track_add_page(kvm, slot, gfn, mode);
+   write_unlock(>mmu_lock);
+   srcu_read_unlock(>srcu, idx);
+   return 0;
+}
+
+static bool
+heki_page_track_prewrite(struct kvm_vcpu *const vcpu, const gpa_t gpa,
+struct kvm_page_track_notifier_node *const node)
+{
+   const gfn_t gfn = gpa_to_gfn(gpa);
+   const struct kvm *const kvm = vcpu->kvm;
+   size_t i;
+
+   /* Checks if it is our own tracked pages, or those of someone else. */
+   for (i = 0; i < HEKI_GFN_MAX; i++) {
+   if (gfn >= kvm->heki_gfn_no_write[i].start &&
+   gfn <= kvm->heki_gfn_no_write[i].end)
+   return false;
+   }
+
+   return true;
+}
+
+static int kvm_heki_init_vm(struct kvm *const kvm)
+{
+   struct kvm_page_track_notifier_node *const node =
+   kzalloc(sizeof(*node), GFP_KERNEL);
+
+   if

[PATCH v1 6/9] KVM: x86: Add Heki hypervisor support

2023-05-05 Thread Mickaël Salaün
From: Madhavan T. Venkataraman 

Each supported hypervisor in x86 implements a struct x86_hyper_init to
define the init functions for the hypervisor.  Define a new init_heki()
entry point in struct x86_hyper_init.  Hypervisors that support Heki
must define this init_heki() function.  Call init_heki() of the chosen
hypervisor in init_hypervisor_platform().

Create a heki_hypervisor structure that each hypervisor can fill
with its data and functions. This will allow the Heki feature to work
in a hypervisor agnostic way.

Declare and initialize a "heki_hypervisor" structure for KVM so KVM can
support Heki.  Define the init_heki() function for KVM.  In init_heki(),
set the hypervisor field in the generic "heki" structure to the KVM
"heki_hypervisor".  After this point, generic Heki code can access the
KVM Heki data and functions.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Co-developed-by: Mickaël Salaün 
Signed-off-by: Mickaël Salaün 
Signed-off-by: Madhavan T. Venkataraman 
Link: https://lore.kernel.org/r/20230505152046.6575-7-...@digikod.net
---
 arch/x86/include/asm/x86_init.h  |  2 +
 arch/x86/kernel/cpu/hypervisor.c |  1 +
 arch/x86/kernel/kvm.c| 72 
 arch/x86/kernel/x86_init.c   |  1 +
 arch/x86/kvm/Kconfig |  1 +
 virt/heki/Kconfig|  9 +++-
 virt/heki/heki.c |  6 ---
 7 files changed, 85 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index c1c8c581759d..0fc5041a66c6 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -119,6 +119,7 @@ struct x86_init_pci {
  * @msi_ext_dest_id:   MSI supports 15-bit APIC IDs
  * @init_mem_mapping:  setup early mappings during init_mem_mapping()
  * @init_after_bootmem:guest init after boot allocator is 
finished
+ * @init_heki: Hypervisor enforced kernel integrity
  */
 struct x86_hyper_init {
void (*init_platform)(void);
@@ -127,6 +128,7 @@ struct x86_hyper_init {
bool (*msi_ext_dest_id)(void);
void (*init_mem_mapping)(void);
void (*init_after_bootmem)(void);
+   void (*init_heki)(void);
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c
index 553bfbfc3a1b..6085c8129e0c 100644
--- a/arch/x86/kernel/cpu/hypervisor.c
+++ b/arch/x86/kernel/cpu/hypervisor.c
@@ -106,4 +106,5 @@ void __init init_hypervisor_platform(void)
 
x86_hyper_type = h->type;
x86_init.hyper.init_platform();
+   x86_init.hyper.init_heki();
 }
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 1cceac5984da..e53cebdcf3ac 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -866,6 +867,45 @@ static void __init kvm_guest_init(void)
hardlockup_detector_disable();
 }
 
+#ifdef CONFIG_HEKI
+
+static int kvm_protect_ranges(struct heki_pa_range *ranges, int num_ranges)
+{
+   size_t size;
+   long err;
+
+   WARN_ON(in_interrupt());
+
+   size = sizeof(ranges[0]) * num_ranges;
+   err = kvm_hypercall3(KVM_HC_LOCK_MEM_PAGE_RANGES, __pa(ranges), size, 
0);
+   if (WARN(err, "Failed to enforce memory protection: %ld\n", err))
+   return err;
+
+   return 0;
+}
+
+extern unsigned long cr4_pinned_mask;
+
+/*
+ * TODO: Check SMP policy consistency, e.g. with
+ * this_cpu_read(cpu_tlbstate.cr4)
+ */
+static int kvm_lock_crs(void)
+{
+   unsigned long cr4;
+   int err;
+
+   err = kvm_hypercall2(KVM_HC_LOCK_CR_UPDATE, 0, X86_CR0_WP);
+   if (err)
+   return err;
+
+   cr4 = __read_cr4();
+   err = kvm_hypercall2(KVM_HC_LOCK_CR_UPDATE, 4, cr4 & cr4_pinned_mask);
+   return err;
+}
+
+#endif /* CONFIG_HEKI */
+
 static noinline uint32_t __kvm_cpuid_base(void)
 {
if (boot_cpu_data.cpuid_level < 0)
@@ -999,6 +1039,37 @@ static bool kvm_sev_es_hcall_finish(struct ghcb *ghcb, 
struct pt_regs *regs)
 }
 #endif
 
+#ifdef CONFIG_HEKI
+
+static struct heki_hypervisor kvm_heki_hypervisor = {
+   .protect_ranges = kvm_protect_ranges,
+   .lock_crs = kvm_lock_crs,
+};
+
+static void kvm_init_heki(void)
+{
+   long err;
+
+   if (!kvm_para_available())
+   /* Cannot make KVM hypercalls. */
+   return;
+
+   err = kvm_hypercall3(KVM_HC_LOCK_MEM_PAGE_RANGES, -1, -1, -1);
+   if (err == -KVM_ENOSYS)
+   /* Ignores host. */
+   return;
+
+   heki.hypervisor = _heki_hypervisor;
+}
+
+#else /* CONFIG_HEKI */
+
+static void kvm_init_heki(void)
+{
+}
+
+#endif /* CONFIG_HEKI */
+
 const __initconst str

[PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking

2023-05-05 Thread Mickaël Salaün
Add a new page tracking mode to deny a page update and throw a page
fault to the guest.  This is useful for KVM to be able to make some
pages non-writable (not read-only because it doesn't imply execution
restrictions), see the next Heki commits.

This kind of synthetic kernel page fault needs to be handled by the
guest, which is not currently the case, making it try again and again.
This will be part of a follow-up patch series.

Update emulator_read_write_onepage() to handle X86EMUL_CONTINUE and
X86EMUL_PROPAGATE_FAULT.

Update page_fault_handle_page_track() to call
kvm_slot_page_track_is_active() whenever this is required for
KVM_PAGE_TRACK_PREWRITE and KVM_PAGE_TRACK_WRITE, even if one tracker
already returned true.

Invert the return code semantic for read_emulate() and write_emulate():
- from 1=Ok 0=Error
- to X86EMUL_* return codes (e.g. X86EMUL_CONTINUE == 0)

Imported the prewrite page tracking support part originally written by
Mihai Donțu, Marian Rotariu, and Ștefan Șicleru:
https://lore.kernel.org/r/20211006173113.26445-27-ala...@bitdefender.com
https://lore.kernel.org/r/20211006173113.26445-28-ala...@bitdefender.com
Removed the GVA changes for page tracking, removed the
X86EMUL_RETRY_INSTR case, and some emulation part for now.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Marian Rotariu 
Cc: Mihai Donțu 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Cc: Ștefan Șicleru 
Signed-off-by: Mickaël Salaün 
Link: https://lore.kernel.org/r/20230505152046.6575-3-...@digikod.net
---
 arch/x86/include/asm/kvm_page_track.h | 12 +
 arch/x86/kvm/mmu/mmu.c| 64 ++-
 arch/x86/kvm/mmu/page_track.c | 33 +-
 arch/x86/kvm/mmu/spte.c   |  6 +++
 arch/x86/kvm/x86.c| 27 +++
 5 files changed, 122 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index eb186bc57f6a..a7fb4ff888e6 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_KVM_PAGE_TRACK_H
 
 enum kvm_page_track_mode {
+   KVM_PAGE_TRACK_PREWRITE,
KVM_PAGE_TRACK_WRITE,
KVM_PAGE_TRACK_MAX,
 };
@@ -22,6 +23,16 @@ struct kvm_page_track_notifier_head {
 struct kvm_page_track_notifier_node {
struct hlist_node node;
 
+   /*
+* It is called when guest is writing the write-tracked page
+* and the write emulation didn't happened yet.
+*
+* @vcpu: the vcpu where the write access happened
+* @gpa: the physical address written by guest
+* @node: this nodet
+*/
+   bool (*track_prewrite)(struct kvm_vcpu *vcpu, gpa_t gpa,
+  struct kvm_page_track_notifier_node *node);
/*
 * It is called when guest is writing the write-tracked page
 * and write emulation is finished at that time.
@@ -73,6 +84,7 @@ kvm_page_track_register_notifier(struct kvm *kvm,
 void
 kvm_page_track_unregister_notifier(struct kvm *kvm,
   struct kvm_page_track_notifier_node *n);
+bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa);
 void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
  int bytes);
 void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 835426254e76..e5d1e241ff0f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -793,9 +793,13 @@ static void account_shadowed(struct kvm *kvm, struct 
kvm_mmu_page *sp)
slot = __gfn_to_memslot(slots, gfn);
 
/* the non-leaf shadow pages are keeping readonly. */
-   if (sp->role.level > PG_LEVEL_4K)
-   return kvm_slot_page_track_add_page(kvm, slot, gfn,
-   KVM_PAGE_TRACK_WRITE);
+   if (sp->role.level > PG_LEVEL_4K) {
+   kvm_slot_page_track_add_page(kvm, slot, gfn,
+KVM_PAGE_TRACK_PREWRITE);
+   kvm_slot_page_track_add_page(kvm, slot, gfn,
+KVM_PAGE_TRACK_WRITE);
+   return;
+   }
 
kvm_mmu_gfn_disallow_lpage(slot, gfn);
 
@@ -840,9 +844,13 @@ static void unaccount_shadowed(struct kvm *kvm, struct 
kvm_mmu_page *sp)
gfn = sp->gfn;
slots = kvm_memslots_for_spte_role(kvm, sp->role);
slot = __gfn_to_memslot(slots, gfn);
-   if (sp->role.level > PG_LEVEL_4K)
-   return kvm_slot_page_track_remove_page(kvm, slot, gfn,
-  KVM_PAGE_TRACK_WRITE);
+   if (sp->role.level > PG_LEVEL_4K) {
+   kvm_s

[PATCH v1 3/9] virt: Implement Heki common code

2023-05-05 Thread Mickaël Salaün
From: Madhavan T. Venkataraman 

Hypervisor Enforced Kernel Integrity (Heki) is a feature that will use
the hypervisor to enhance guest virtual machine security.

Configuration
=

Define the config variables for the feature. This feature depends on
support from the architecture as well as the hypervisor.

Enabling HEKI
=

Define a kernel command line parameter "heki" to turn the feature on or
off. By default, Heki is on.

Feature initialization
==

The linker script, vmlinux.lds.S, defines a number of sections that are
loaded in kernel memory. Each of these sections has its own permissions.
For instance, .text has HEKI_ATTR_MEM_EXEC | HEKI_ATTR_MEM_NOWRITE, and
.rodata has HEKI_ATTR_MEM_NOWRITE.

Define an architecture specific init function, heki_arch_init(). In this
function, collect the ranges of all of the sections. These sections will
be protected in the host page table with their respective permissions so
that even if the guest kernel is compromised, their permissions cannot
be changed.

Define heki_early_init() to initialize the feature. For now, this
function just checks if the feature is enabled and calls
heki_arch_init().

Define heki_late_init() that protects the sections in the host page
table. This needs hypervisor support which will be introduced in the
future.  This function is called at the end of kernel init.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Mickaël Salaün 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Signed-off-by: Madhavan T. Venkataraman 
Link: https://lore.kernel.org/r/20230505152046.6575-4-...@digikod.net
---
 Kconfig |   2 +
 arch/x86/Kconfig|   1 +
 arch/x86/include/asm/sections.h |   4 +
 arch/x86/kernel/setup.c |  49 
 include/linux/heki.h|  90 +
 init/main.c |   3 +
 virt/Makefile   |   1 +
 virt/heki/Kconfig   |  22 ++
 virt/heki/Makefile  |   3 +
 virt/heki/heki.c| 135 
 10 files changed, 310 insertions(+)
 create mode 100644 include/linux/heki.h
 create mode 100644 virt/heki/Kconfig
 create mode 100644 virt/heki/Makefile
 create mode 100644 virt/heki/heki.c

diff --git a/Kconfig b/Kconfig
index 745bc773f567..0c844d9bcb03 100644
--- a/Kconfig
+++ b/Kconfig
@@ -29,4 +29,6 @@ source "lib/Kconfig"
 
 source "lib/Kconfig.debug"
 
+source "virt/heki/Kconfig"
+
 source "Documentation/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3604074a878b..5cf5a7a97811 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -297,6 +297,7 @@ config X86
select FUNCTION_ALIGNMENT_4B
imply IMA_SECURE_AND_OR_TRUSTED_BOOTif EFI
select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
+   select ARCH_SUPPORTS_HEKI   if X86_64
 
 config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index a6e8373a5170..42ef1e33b8a5 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -18,6 +18,10 @@ extern char __end_of_kernel_reserve[];
 
 extern unsigned long _brk_start, _brk_end;
 
+extern int __start_orc_unwind_ip[], __stop_orc_unwind_ip[];
+extern struct orc_entry __start_orc_unwind[], __stop_orc_unwind[];
+extern unsigned int orc_lookup[], orc_lookup_end[];
+
 static inline bool arch_is_kernel_initmem_freed(unsigned long addr)
 {
/*
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 88188549647c..f0ddaf24ab63 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -850,6 +851,54 @@ static void __init x86_report_nx(void)
}
 }
 
+#ifdef CONFIG_HEKI
+
+/*
+ * Gather all of the statically defined sections so heki_late_init() can
+ * protect these sections in the host page table.
+ *
+ * The sections are defined under "SECTIONS" in vmlinux.lds.S
+ * Keep this array in sync with SECTIONS.
+ */
+struct heki_va_range __initdata heki_va_ranges[] = {
+   {
+   .va_start = _stext,
+   .va_end = _etext,
+   .attributes = HEKI_ATTR_MEM_NOWRITE | HEKI_ATTR_MEM_EXEC,
+   },
+   {
+   .va_start = __start_rodata,
+   .va_end = __end_rodata,
+   .attributes = HEKI_ATTR_MEM_NOWRITE,
+   },
+#ifdef CONFIG_UNWINDER_ORC
+   {
+   .va_start = __start_orc_unwind_ip,
+   .va_end = __stop_orc_unwind_ip,
+   .attributes = HEKI_ATTR_MEM_NOWRITE,
+   },
+   {
+   .va_start = __start_orc_unwind,
+   .va_end = __stop_orc

[PATCH v1 5/9] KVM: x86: Add new hypercall to lock control registers

2023-05-05 Thread Mickaël Salaün
This enables guests to lock their CR0 and CR4 registers with a subset of
X86_CR0_WP, X86_CR4_SMEP, X86_CR4_SMAP, X86_CR4_UMIP, X86_CR4_FSGSBASE
and X86_CR4_CET flags.

The new KVM_HC_LOCK_CR_UPDATE hypercall takes two arguments.  The first
is to identify the control register, and the second is a bit mask to
pin (i.e. mark as read-only).

These register flags should already be pinned by Linux guests, but once
compromised, this self-protection mechanism could be disabled, which is
not the case with this dedicated hypercall.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Signed-off-by: Mickaël Salaün 
Link: https://lore.kernel.org/r/20230505152046.6575-6-...@digikod.net
---
 Documentation/virt/kvm/x86/hypercalls.rst | 15 +
 arch/x86/kernel/cpu/common.c  |  2 +-
 arch/x86/kvm/vmx/vmx.c| 10 
 arch/x86/kvm/x86.c| 72 +++
 arch/x86/kvm/x86.h| 16 +
 include/linux/kvm_host.h  |  3 +
 include/uapi/linux/kvm_para.h |  1 +
 7 files changed, 118 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/x86/hypercalls.rst 
b/Documentation/virt/kvm/x86/hypercalls.rst
index 0ec79cc77f53..8aa5d28986e3 100644
--- a/Documentation/virt/kvm/x86/hypercalls.rst
+++ b/Documentation/virt/kvm/x86/hypercalls.rst
@@ -207,3 +207,18 @@ identified with set of physical page ranges (GFNs).  The 
HEKI_ATTR_MEM_NOWRITE
 memory page range attribute forbids related modification to the guest.
 
 Returns 0 on success or a KVM error code otherwise.
+
+10. KVM_HC_LOCK_CR_UPDATE
+-
+
+:Architecture: x86
+:Status: active
+:Purpose: Request some control registers to be restricted.
+
+- a0: identify a control register
+- a1: bit mask to make some flags read-only
+
+The hypercall lets a guest request control register flags to be pinned for
+itself.
+
+Returns 0 on success or a KVM error code otherwise.
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f3cc7699e1e1..dd89379fe5ac 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -413,7 +413,7 @@ static __always_inline void setup_umip(struct cpuinfo_x86 
*c)
 }
 
 /* These bits should not change their value after CPU init is finished. */
-static const unsigned long cr4_pinned_mask =
+const unsigned long cr4_pinned_mask =
X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |
X86_CR4_FSGSBASE | X86_CR4_CET;
 static DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9870db887a62..931688edc8eb 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3162,6 +3162,11 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long 
cr0)
struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned long hw_cr0, old_cr0_pg;
u32 tmp;
+   int res;
+
+   res = heki_check_cr(vcpu->kvm, 0, cr0);
+   if (res)
+   return;
 
old_cr0_pg = kvm_read_cr0_bits(vcpu, X86_CR0_PG);
 
@@ -3323,6 +3328,11 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long 
cr4)
 * this bit, even if host CR4.MCE == 0.
 */
unsigned long hw_cr4;
+   int res;
+
+   res = heki_check_cr(vcpu->kvm, 4, cr4);
+   if (res)
+   return;
 
hw_cr4 = (cr4_read_shadow() & X86_CR4_MCE) | (cr4 & ~X86_CR4_MCE);
if (is_unrestricted_guest(vcpu))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ffab64d08de3..a529455359ac 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7927,11 +7927,77 @@ static unsigned long emulator_get_cr(struct 
x86_emulate_ctxt *ctxt, int cr)
return value;
 }
 
+#ifdef CONFIG_HEKI
+
+extern unsigned long cr4_pinned_mask;
+
+static int heki_lock_cr(struct kvm *const kvm, const unsigned long cr,
+   unsigned long pin)
+{
+   if (!pin)
+   return -KVM_EINVAL;
+
+   switch (cr) {
+   case 0:
+   /* Cf. arch/x86/kernel/cpu/common.c */
+   if (!(pin & X86_CR0_WP))
+   return -KVM_EINVAL;
+
+   if ((read_cr0() & pin) != pin)
+   return -KVM_EINVAL;
+
+   atomic_long_or(pin, >heki_pinned_cr0);
+   return 0;
+   case 4:
+   /* Checks for irrelevant bits. */
+   if ((pin & cr4_pinned_mask) != pin)
+   return -KVM_EINVAL;
+
+   /* Ignores bits not present in host. */
+   pin &= __read_cr4();
+   atomic_long_or(pin, >heki_pinned_cr4);
+   return 0;
+   }
+   return -KVM_EINVAL;
+}
+
+int heki_check_cr(const struct kvm *const kvm, const unsigned long cr,
+

[PATCH v1 1/9] KVM: x86: Add kvm_x86_ops.fault_gva()

2023-05-05 Thread Mickaël Salaün
This function is needed for kvm_mmu_page_fault() to create synthetic
page faults.

Code originally written by Mihai Donțu and Nicușor Cîțu:
https://lore.kernel.org/r/20211006173113.26445-18-ala...@bitdefender.com
Renamed fault_gla() to fault_gva() and use the new
EPT_VIOLATION_GVA_IS_VALID.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Co-developed-by: Mihai Donțu 
Signed-off-by: Mihai Donțu 
Co-developed-by: Nicușor Cîțu 
Signed-off-by: Nicușor Cîțu 
Signed-off-by: Mickaël Salaün 
Link: https://lore.kernel.org/r/20230505152046.6575-2-...@digikod.net
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h|  2 ++
 arch/x86/kvm/svm/svm.c |  9 +
 arch/x86/kvm/vmx/vmx.c | 10 ++
 4 files changed, 22 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h 
b/arch/x86/include/asm/kvm-x86-ops.h
index abccd51dcfca..b761182a9444 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -131,6 +131,7 @@ KVM_X86_OP(msr_filter_changed)
 KVM_X86_OP(complete_emulated_msr)
 KVM_X86_OP(vcpu_deliver_sipi_vector)
 KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
+KVM_X86_OP(fault_gva)
 
 #undef KVM_X86_OP
 #undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6aaae18f1854..f319bcdeb8bd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1706,6 +1706,8 @@ struct kvm_x86_ops {
 * Returns vCPU specific APICv inhibit reasons
 */
unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
+
+   u64 (*fault_gva)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 9a194aa1a75a..8b47b38aaf7f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4700,6 +4700,13 @@ static int svm_vm_init(struct kvm *kvm)
return 0;
 }
 
+static u64 svm_fault_gva(struct kvm_vcpu *vcpu)
+{
+   const struct vcpu_svm *svm = to_svm(vcpu);
+
+   return svm->vcpu.arch.cr2 ? svm->vcpu.arch.cr2 : ~0ull;
+}
+
 static struct kvm_x86_ops svm_x86_ops __initdata = {
.name = "kvm_amd",
 
@@ -4826,6 +4833,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 
.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
+
+   .fault_gva = svm_fault_gva,
 };
 
 /*
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7eec0226d56a..9870db887a62 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8067,6 +8067,14 @@ static void vmx_vm_destroy(struct kvm *kvm)
free_pages((unsigned long)kvm_vmx->pid_table, 
vmx_get_pid_table_order(kvm));
 }
 
+static u64 vmx_fault_gva(struct kvm_vcpu *vcpu)
+{
+   if (vcpu->arch.exit_qualification & EPT_VIOLATION_GVA_IS_VALID)
+   return vmcs_readl(GUEST_LINEAR_ADDRESS);
+
+   return ~0ull;
+}
+
 static struct kvm_x86_ops vmx_x86_ops __initdata = {
.name = "kvm_intel",
 
@@ -8204,6 +8212,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.complete_emulated_msr = kvm_complete_insn_gp,
 
.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+
+   .fault_gva = vmx_fault_gva,
 };
 
 static unsigned int vmx_handle_intel_pt_intr(void)
-- 
2.40.1




[RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

2023-05-05 Thread Mickaël Salaün
]).

Extending register pinning (e.g., MSRs).

Being able to protect nested guests might be possible but we need to figure out
the potential security implications.

Protecting the host would be useful, but that doesn't really fit with the KVM
model. The Protected KVM project is a first step to help in this direction
[11].

We only tested this with an Intel CPU, but this approach should work the same
with an AMD CPU starting with the Zen 2 generation and their Guest Mode Execute
Trap (GMET) capability.

We also kept some TODOs to highlight missing checks and code sharing issues,
and some pr_warn() calls to help understand how it works. Tests need to be
improved (e.g., invalid hypercall arguments).

We'll present this work at the Linux Security Summit North America next week.

[1] https://lore.kernel.org/all/20211006173113.26445-1-ala...@bitdefender.com/
[2] https://www.linux-kvm.org/images/7/72/KVMForum2017_Introspection.pdf
[3] 
https://lore.kernel.org/all/20200617190757.27081-1-john.s.ander...@intel.com/
[4] https://github.com/intel/vbh
[5] https://sched.co/TmwN
[6] https://sched.co/eE3f
[7] https://lore.kernel.org/all/20200501185147.208192-1-yua...@google.com/
[8] https://sched.co/eE4F
[9] 
https://lore.kernel.org/kvm/20191003212400.31130-1-rick.p.edgeco...@intel.com/
[10] https://lpc.events/event/4/contributions/283/
[11] https://sched.co/eE24

Please reach out to us by replying to this thread, we're looking for
people to join and collaborate on this project!

Regards,

Madhavan T. Venkataraman (2):
  virt: Implement Heki common code
  KVM: x86: Add Heki hypervisor support

Mickaël Salaün (7):
  KVM: x86: Add kvm_x86_ops.fault_gva()
  KVM: x86/mmu: Add support for prewrite page tracking
  KVM: x86: Add new hypercall to set EPT permissions
  KVM: x86: Add new hypercall to lock control registers
  KVM: VMX: Add MBEC support
  KVM: x86/mmu: Enable guests to lock themselves thanks to MBEC
  virt: Add Heki KUnit tests

 Documentation/virt/kvm/x86/hypercalls.rst |  34 +++
 Kconfig   |   2 +
 arch/x86/Kconfig  |   1 +
 arch/x86/include/asm/kvm-x86-ops.h|   1 +
 arch/x86/include/asm/kvm_host.h   |   2 +
 arch/x86/include/asm/kvm_page_track.h |  12 +
 arch/x86/include/asm/sections.h   |   4 +
 arch/x86/include/asm/vmx.h|  11 +-
 arch/x86/include/asm/x86_init.h   |   2 +
 arch/x86/kernel/cpu/common.c  |   2 +-
 arch/x86/kernel/cpu/hypervisor.c  |   1 +
 arch/x86/kernel/kvm.c |  72 +
 arch/x86/kernel/setup.c   |  49 +++
 arch/x86/kernel/x86_init.c|   1 +
 arch/x86/kvm/Kconfig  |   1 +
 arch/x86/kvm/mmu.h|   3 +-
 arch/x86/kvm/mmu/mmu.c| 105 ++-
 arch/x86/kvm/mmu/mmutrace.h   |  11 +-
 arch/x86/kvm/mmu/page_track.c |  33 +-
 arch/x86/kvm/mmu/paging_tmpl.h|  16 +-
 arch/x86/kvm/mmu/spte.c   |  29 +-
 arch/x86/kvm/mmu/spte.h   |  15 +-
 arch/x86/kvm/mmu/tdp_mmu.c|  73 +
 arch/x86/kvm/mmu/tdp_mmu.h|   4 +
 arch/x86/kvm/svm/svm.c|   9 +
 arch/x86/kvm/vmx/capabilities.h   |   7 +
 arch/x86/kvm/vmx/nested.c |   7 +
 arch/x86/kvm/vmx/vmx.c|  48 ++-
 arch/x86/kvm/vmx/vmx.h|   1 +
 arch/x86/kvm/x86.c| 352 +-
 arch/x86/kvm/x86.h|  23 ++
 include/linux/heki.h  |  90 ++
 include/linux/kvm_host.h  |  20 ++
 include/uapi/linux/kvm_para.h |   2 +
 init/main.c   |   3 +
 virt/Makefile |   1 +
 virt/heki/Kconfig |  41 +++
 virt/heki/Makefile|   3 +
 virt/heki/heki.c  | 321 
 virt/kvm/kvm_main.c   |   5 +
 40 files changed, 1377 insertions(+), 40 deletions(-)
 create mode 100644 include/linux/heki.h
 create mode 100644 virt/heki/Kconfig
 create mode 100644 virt/heki/Makefile
 create mode 100644 virt/heki/heki.c


base-commit: c9c3395d5e3dcc6daee66c6908354d47bf98cb0c
-- 
2.40.1




[PATCH v1 7/9] KVM: VMX: Add MBEC support

2023-05-05 Thread Mickaël Salaün
This changes add support for VMX_FEATURE_MODE_BASED_EPT_EXEC (named
ept_mode_based_exec in /proc/cpuinfo and MBEC elsewhere), which enables
to separate EPT execution bits for supervisor vs. user.  It transforms
the semantic of VMX_EPT_EXECUTABLE_MASK from a global execution to a
kernel execution, and use the VMX_EPT_USER_EXECUTABLE_MASK bit to
identify user execution.

The main use case is to be able to restrict kernel execution while
ignoring user space execution from the hypervisor point of view.
Indeed, user space execution can already be restricted by the guest
kernel.

This change enables MBEC but doesn't change the default configuration,
which is to allow execution for all guest memory.  However, the next
commit levages MBEC to restrict kernel memory pages.

MBEC can be configured with the new "enable_mbec" module parameter, set
to true by default.  However, MBEC is disable for L1 and L2 for now.

Replace EPT_VIOLATION_RWX_MASK (3 bits) with 4 dedicated
EPT_VIOLATION_READ, EPT_VIOLATION_WRITE, EPT_VIOLATION_KERNEL_INSTR, and
EPT_VIOLATION_USER_INSTR bits.

>From the Intel 64 and IA-32 Architectures Software Developer's Manual,
Volume 3C (System Programming Guide), Part 3:

SECONDARY_EXEC_MODE_BASED_EPT_EXEC (bit 22):
If either the "unrestricted guest" VM-execution control or the
"mode-based execute control for EPT" VM-execution control is 1, the
"enable EPT" VM-execution control must also be 1.

EPT_VIOLATION_KERNEL_INSTR_BIT (bit 5):
The logical-AND of bit 2 in the EPT paging-structure entries used to
translate the guest-physical address of the access causing the EPT
violation.  If the "mode-based execute control for EPT" VM-execution
control is 0, this indicates whether the guest-physical address was
executable. If that control is 1, this indicates whether the
guest-physical address was executable for supervisor-mode linear
addresses.

EPT_VIOLATION_USER_INSTR_BIT (bit 6):
If the "mode-based execute control" VM-execution control is 0, the value
of this bit is undefined. If that control is 1, this bit is the
logical-AND of bit 10 in the EPT paging-structures entries used to
translate the guest-physical address of the access causing the EPT
violation. In this case, it indicates whether the guest-physical address
was executable for user-mode linear addresses.

PT_USER_EXEC_MASK (bit 10):
Execute access for user-mode linear addresses. If the "mode-based
execute control for EPT" VM-execution control is 1, indicates whether
instruction fetches are allowed from user-mode linear addresses in the
512-GByte region controlled by this entry. If that control is 0, this
bit is ignored.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Signed-off-by: Mickaël Salaün 
Link: https://lore.kernel.org/r/20230505152046.6575-8-...@digikod.net
---
 arch/x86/include/asm/vmx.h  | 11 +--
 arch/x86/kvm/mmu.h  |  3 ++-
 arch/x86/kvm/mmu/mmu.c  |  6 +-
 arch/x86/kvm/mmu/paging_tmpl.h  | 16 ++--
 arch/x86/kvm/mmu/spte.c |  4 +++-
 arch/x86/kvm/vmx/capabilities.h |  7 +++
 arch/x86/kvm/vmx/nested.c   |  7 +++
 arch/x86/kvm/vmx/vmx.c  | 28 +---
 arch/x86/kvm/vmx/vmx.h  |  1 +
 9 files changed, 73 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 498dc600bd5c..452e7d153832 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -511,6 +511,7 @@ enum vmcs_field {
 #define VMX_EPT_IPAT_BIT   (1ull << 6)
 #define VMX_EPT_ACCESS_BIT (1ull << 8)
 #define VMX_EPT_DIRTY_BIT  (1ull << 9)
+#define VMX_EPT_USER_EXECUTABLE_MASK   (1ull << 10)
 #define VMX_EPT_RWX_MASK(VMX_EPT_READABLE_MASK |   
\
 VMX_EPT_WRITABLE_MASK |   \
 VMX_EPT_EXECUTABLE_MASK)
@@ -556,13 +557,19 @@ enum vm_entry_failure_code {
 #define EPT_VIOLATION_ACC_READ_BIT 0
 #define EPT_VIOLATION_ACC_WRITE_BIT1
 #define EPT_VIOLATION_ACC_INSTR_BIT2
-#define EPT_VIOLATION_RWX_SHIFT3
+#define EPT_VIOLATION_READ_BIT 3
+#define EPT_VIOLATION_WRITE_BIT4
+#define EPT_VIOLATION_KERNEL_INSTR_BIT 5
+#define EPT_VIOLATION_USER_INSTR_BIT   6
 #define EPT_VIOLATION_GVA_IS_VALID_BIT 7
 #define EPT_VIOLATION_GVA_TRANSLATED_BIT 8
 #define EPT_VIOLATION_ACC_READ (1 << EPT_VIOLATION_ACC_READ_BIT)
 #define EPT_VIOLATION_ACC_WRITE(1 << 
EPT_VIOLATION_ACC_WRITE_BIT)
 #define EPT_VIOLATION_ACC_INSTR(1 << 
EPT_VIOLATION_ACC_INSTR_BIT)
-#define EPT_V

[PATCH v1 8/9] KVM: x86/mmu: Enable guests to lock themselves thanks to MBEC

2023-05-05 Thread Mickaël Salaün
This changes enable to enforce a deny-by-default execution security
policy for guest kernels, leveraged by the Heki implementation.

Create synthetic page faults when an access is denied by Heki.  This
kind of kernel page fault needs to be handled by guests, which is not
currently the case, making it try again and again, but we are working to
calm down such guests by teaching them how to handle such page faults.

The MMU tracepoints are updated to reflect the difference between kernel
and user space executions.

kvm_heki_fix_all_ept_exec_perm() walks through all guest memory pages to
set the configured default execution permissions (i.e. only allow
configured executabel memory pages).

The struct heki_mem_range's attribute field now understand
HEKI_ATTR_MEM_EXEC, which allows the related kernel sections to be
executable, and deny any other kernel memory from being executable for
the whole lifetime of the guest.  This obviously can only work with
static kernels and we are exploring ways to handle authenticated and
dynamic kernel memory permission updates.

If the host doesn't have MBEC enabled, the KVM_HC_LOCK_MEM_PAGE_RANGES
hypecall will return -KVM_EOPNOTSUPP and might only apply the previous
ranges, if any.  This is useful to develop this RFC and make sure
execution restrictions are enforced (and not silently ignored), but this
behavior might change in a future patch series.  Guest kernels could
check for MBEC support to not use the HEKI_ATTR_MEM_EXEC attribute.

The number of configurable memory ranges per guest is 16 for now.  This
will change with a follow-up.

There are currently some pr_warn() calls to make it easy to test this
code.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Signed-off-by: Mickaël Salaün 
Link: https://lore.kernel.org/r/20230505152046.6575-9-...@digikod.net
---
 Documentation/virt/kvm/x86/hypercalls.rst |  4 +-
 arch/x86/kvm/mmu/mmu.c| 35 -
 arch/x86/kvm/mmu/mmutrace.h   | 11 ++-
 arch/x86/kvm/mmu/spte.c   | 19 -
 arch/x86/kvm/mmu/spte.h   | 15 +++-
 arch/x86/kvm/mmu/tdp_mmu.c| 73 ++
 arch/x86/kvm/mmu/tdp_mmu.h|  4 +
 arch/x86/kvm/x86.c| 90 ++-
 arch/x86/kvm/x86.h|  7 ++
 include/linux/kvm_host.h  |  4 +
 virt/kvm/kvm_main.c   |  1 +
 11 files changed, 250 insertions(+), 13 deletions(-)

diff --git a/Documentation/virt/kvm/x86/hypercalls.rst 
b/Documentation/virt/kvm/x86/hypercalls.rst
index 8aa5d28986e3..5accf5f6de13 100644
--- a/Documentation/virt/kvm/x86/hypercalls.rst
+++ b/Documentation/virt/kvm/x86/hypercalls.rst
@@ -204,7 +204,9 @@ must also set up an MSR filter to process writes to 
MSR_KVM_MIGRATION_CONTROL.
 
 The hypercall lets a guest request memory permissions to be removed for itself,
 identified with set of physical page ranges (GFNs).  The HEKI_ATTR_MEM_NOWRITE
-memory page range attribute forbids related modification to the guest.
+memory page range attribute forbids related modification to the guest.  The
+HEKI_ATTR_MEM_EXEC attribute allows execution on the specified pages while
+removing it for all the others.
 
 Returns 0 on success or a KVM error code otherwise.
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a47e63217eb8..56a8bcac1b82 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3313,7 +3313,7 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct 
kvm_page_fault *fault,
 static bool is_access_allowed(struct kvm_page_fault *fault, u64 spte)
 {
if (fault->exec)
-   return is_executable_pte(spte);
+   return is_executable_pte(spte, !fault->user);
 
if (fault->write)
return is_writable_pte(spte);
@@ -5602,6 +5602,39 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, 
gpa_t cr2_or_gpa, u64 err
if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
return RET_PF_RETRY;
 
+   /* Skips real page faults if not needed. */
+   if ((error_code & PFERR_FETCH_MASK) &&
+   !kvm_heki_is_exec_allowed(vcpu, cr2_or_gpa)) {
+   /*
+* TODO: To avoid kvm_heki_is_exec_allowed() call, check
+* enable_mbec and EPT_VIOLATION_KERNEL_INSTR, see
+* handle_ept_violation().
+*/
+   struct x86_exception fault = {
+   .vector = PF_VECTOR,
+   .error_code_valid = true,
+   .error_code = error_code,
+   .nested_page_fault = false,
+   /*
+* TODO: This kind of kernel page fault needs to be 
handled by
+

[PATCH v1 9/9] virt: Add Heki KUnit tests

2023-05-05 Thread Mickaël Salaün
This adds a new CONFIG_HEKI_TEST option to run tests at boot.  Indeed,
because this patch series forbids the loading of kernel modules after
the boot, we need to make built-in tests.  Furthermore, because we use
some symbols not exported to modules (e.g., kernel_set_to_readonly) this
could not work as modules.

To run these tests, we need to boot the kernel with the heki_test=N boot
argument with N selecting a specific test:
1. heki_test_cr_disable_smep: Check CR pinning and try to disable SMEP.
2. heki_test_write_to_const: Check .rodata (const) protection.
3. heki_test_write_to_ro_after_init: Check __ro_after_init protection.
4. heki_test_exec: Check non-executable kernel memory.

This way to select tests should not be required when the kernel will
properly handle the triggered synthetic page faults.  For now, these
page faults make the kernel loop.

All these tests temporarily disable the related kernel self-protections
and should then failed if Heki doesn't protect the kernel.  They are
verbose to make it easier to understand what is going on.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Signed-off-by: Mickaël Salaün 
Link: https://lore.kernel.org/r/20230505152046.6575-10-...@digikod.net
---
 virt/heki/Kconfig |  12 +++
 virt/heki/heki.c  | 194 +-
 2 files changed, 205 insertions(+), 1 deletion(-)

diff --git a/virt/heki/Kconfig b/virt/heki/Kconfig
index 96f18ce03013..806981f2b22d 100644
--- a/virt/heki/Kconfig
+++ b/virt/heki/Kconfig
@@ -27,3 +27,15 @@ config HYPERVISOR_SUPPORTS_HEKI
  A hypervisor should select this when it can successfully build
  and run with CONFIG_HEKI. That is, it should provide all of the
  hypervisor support required for the Heki feature.
+
+config HEKI_TEST
+   bool "Tests for Heki" if !KUNIT_ALL_TESTS
+   depends on HEKI && KUNIT=y
+   default KUNIT_ALL_TESTS
+   help
+ Run Heki tests at runtime according to the heki_test=N boot
+ parameter, with N identifying the test to run (between 1 and 4).
+
+ Before launching the init process, the system might not respond
+ because of unhandled kernel page fault.  This will be fixed in a
+ next patch series.
diff --git a/virt/heki/heki.c b/virt/heki/heki.c
index 142b5dc98a2f..361e7734e950 100644
--- a/virt/heki/heki.c
+++ b/virt/heki/heki.c
@@ -5,11 +5,13 @@
  * Copyright © 2023 Microsoft Corporation
  */
 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -78,13 +80,201 @@ void __init heki_early_init(void)
heki_arch_init();
 }
 
+#ifdef CONFIG_HEKI_TEST
+
+/* Heki test data */
+
+/* Takes two pages to not change permission of other read-only pages. */
+const char heki_test_const_buf[PAGE_SIZE * 2] = {};
+char heki_test_ro_after_init_buf[PAGE_SIZE * 2] __ro_after_init = {};
+
+long heki_test_exec_data(long);
+void _test_exec_data_end(void);
+
+/* Used to test ROP execution against the .rodata section. */
+/* clang-format off */
+asm(
+".pushsection .rodata;" // NOT .text section
+".global heki_test_exec_data;"
+".type heki_test_exec_data, @function;"
+"heki_test_exec_data:"
+ASM_ENDBR
+"movq %rdi, %rax;"
+"inc %rax;"
+ASM_RET
+".size heki_test_exec_data, .-heki_test_exec_data;"
+"_test_exec_data_end:"
+".popsection");
+/* clang-format on */
+
+static void heki_test_cr_disable_smep(struct kunit *test)
+{
+   unsigned long cr4;
+
+   /* SMEP should be initially enabled. */
+   KUNIT_ASSERT_TRUE(test, __read_cr4() & X86_CR4_SMEP);
+
+   kunit_warn(test,
+  "Starting control register pinning tests with SMEP check\n");
+
+   /*
+* Trying to disable SMEP, bypassing kernel self-protection by not
+* using cr4_clear_bits(X86_CR4_SMEP).
+*/
+   cr4 = __read_cr4() & ~X86_CR4_SMEP;
+   asm volatile("mov %0,%%cr4" : "+r"(cr4) : : "memory");
+
+   /* SMEP should still be enabled. */
+   KUNIT_ASSERT_TRUE(test, __read_cr4() & X86_CR4_SMEP);
+}
+
+static inline void print_addr(struct kunit *test, const char *const buf_name,
+ void *const buf)
+{
+   const pte_t pte = *virt_to_kpte((unsigned long)buf);
+   const phys_addr_t paddr = slow_virt_to_phys(buf);
+   bool present = pte_flags(pte) & (_PAGE_PRESENT);
+   bool accessible = pte_accessible(_mm, pte);
+
+   kunit_warn(
+   test,
+   "%s vaddr:%llx paddr:%llx exec:%d write:%d present:%d 
accessible:%d\n",
+   buf_name, (unsigned long long)buf, paddr, !!pte_exec(pte),
+   !!pte_write(pte), present, acces