On 25.09.25 17:52, Roy, Patrick wrote:
On Thu, 2025-09-25 at 12:00 +0100, David Hildenbrand wrote:
On 24.09.25 17:22, Roy, Patrick wrote:
Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD()
ioctl. When set, guest_memfd folios will be removed from the direct map
after preparation, with direct map entries only restored when the folios
are freed.
To ensure these folios do not end up in places where the kernel cannot
deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested.
Add KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP to let userspace discover whether
guest_memfd supports GUEST_MEMFD_FLAG_NO_DIRECT_MAP. Support depends on
guest_memfd itself being supported, but also on whether linux supports
manipulatomg the direct map at page granularity at all (possible most of
the time, outliers being arm64 where its impossible if the direct map
has been setup using hugepages, as arm64 cannot break these apart due to
break-before-make semantics, and powerpc, which does not select
ARCH_HAS_SET_DIRECT_MAP, though also doesn't support guest_memfd
anyway).
Note that this flag causes removal of direct map entries for all
guest_memfd folios independent of whether they are "shared" or "private"
(although current guest_memfd only supports either all folios in the
"shared" state, or all folios in the "private" state if
GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map
entries of also the shared parts of guest_memfd are a special type of
non-CoCo VM where, host userspace is trusted to have access to all of
guest memory, but where Spectre-style transient execution attacks
through the host kernel's direct map should still be mitigated. In this
setup, KVM retains access to guest memory via userspace mappings of
guest_memfd, which are reflected back into KVM's memslots via
userspace_addr. This is needed for things like MMIO emulation on x86_64
to work.
Direct map entries are zapped right before guest or userspace mappings
of gmem folios are set up, e.g. in kvm_gmem_fault_user_mapping() or
kvm_gmem_get_pfn() [called from the KVM MMU code]. The only place where
a gmem folio can be allocated without being mapped anywhere is
kvm_gmem_populate(), where handling potential failures of direct map
removal is not possible (by the time direct map removal is attempted,
the folio is already marked as prepared, meaning attempting to re-try
kvm_gmem_populate() would just result in -EEXIST without fixing up the
direct map state). These folios are then removed form the direct map
upon kvm_gmem_get_pfn(), e.g. when they are mapped into the guest later.
Signed-off-by: Patrick Roy <[email protected]>
---
Documentation/virt/kvm/api.rst | 5 +++
arch/arm64/include/asm/kvm_host.h | 12 ++++++
include/linux/kvm_host.h | 6 +++
include/uapi/linux/kvm.h | 2 +
virt/kvm/guest_memfd.c | 61 ++++++++++++++++++++++++++++++-
virt/kvm/kvm_main.c | 5 +++
6 files changed, 90 insertions(+), 1 deletion(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index c17a87a0a5ac..b52c14d58798 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6418,6 +6418,11 @@ When the capability KVM_CAP_GUEST_MEMFD_MMAP is
supported, the 'flags' field
supports GUEST_MEMFD_FLAG_MMAP. Setting this flag on guest_memfd creation
enables mmap() and faulting of guest_memfd memory to host userspace.
+When the capability KVM_CAP_GMEM_NO_DIRECT_MAP is supported, the 'flags' field
+supports GUEST_MEMFG_FLAG_NO_DIRECT_MAP. Setting this flag makes the
guest_memfd
+instance behave similarly to memfd_secret, and unmaps the memory backing it
from
+the kernel's address space after allocation.
+
Do we want to document what the implication of that is? Meaning,
limitations etc. I recall that we would need the user mapping for gmem
slots to be properly set up.
Is that still the case in this patch set?
The ->userspace_addr thing is the general requirement for non-CoCo VMs,
and not specific for direct map removal (e.g. I expect direct map
removal to just work out of the box for CoCo setups, where KVM already
cannot access guest memory, ignoring the question of whether direct map
removal is even useful for CoCo VMs). So I don't think it should be
documented as part of
KVM_CAP_GMEM_NO_DIRECT_MAP/GUEST_MEMFG_FLAG_NO_DIRECT_MAP (heh, there's
a typo I just noticed.
Okay I was rather wondering whether this will be the first patch set
where it is actually required to be set. In the basic mmap series, I am
not sure yet if we really depend on it (but IIRC we did document it, but
do no sanity checks etc).
"MEMFG". Also "GMEM" needs to be "GUEST_MEMFD".
Will fix that), but rather as part of GUEST_MEMFD_FLAG_MMAP. I can add a
patch it there (or maybe send it separately, since FLAG_MMAP is already
in -next?).
Yes, it's in kvm/next and will go upstream soon.
When the KVM MMU performs a PFN lookup to service a guest fault and the
backing
guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
consumed from guest_memfd, regardless of whether it is a shared or a private
diff --git a/arch/arm64/include/asm/kvm_host.h
b/arch/arm64/include/asm/kvm_host.h
index 2f2394cce24e..0bfd8e5fd9de 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -19,6 +19,7 @@
#include <linux/maple_tree.h>
#include <linux/percpu.h>
#include <linux/psci.h>
+#include <linux/set_memory.h>
#include <asm/arch_gicv3.h>
#include <asm/barrier.h>
#include <asm/cpufeature.h>
@@ -1706,5 +1707,16 @@ void compute_fgu(struct kvm *kvm, enum fgt_group_id fgt);
void get_reg_fixed_bits(struct kvm *kvm, enum vcpu_sysreg reg, u64 *res0,
u64 *res1);
void check_feature_map(void);
+#ifdef CONFIG_KVM_GUEST_MEMFD
+static inline bool kvm_arch_gmem_supports_no_direct_map(void)
+{
+ /*
+ * Without FWB, direct map access is needed in kvm_pgtable_stage2_map(),
+ * as it calls dcache_clean_inval_poc().
+ */
+ return can_set_direct_map() && cpus_have_final_cap(ARM64_HAS_STAGE2_FWB);
+}
+#define kvm_arch_gmem_supports_no_direct_map
kvm_arch_gmem_supports_no_direct_map
+#endif /* CONFIG_KVM_GUEST_MEMFD */
I strongly assume that the aarch64 support should be moved to a separate
patch -- if possible, see below.
#endif /* __ARM64_KVM_HOST_H__ */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1d0585616aa3..73a15cade54a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -731,6 +731,12 @@ static inline bool kvm_arch_has_private_mem(struct kvm
*kvm)
bool kvm_arch_supports_gmem_mmap(struct kvm *kvm);
#endif
+#ifdef CONFIG_KVM_GUEST_MEMFD
+#ifndef kvm_arch_gmem_supports_no_direct_map
+#define kvm_arch_gmem_supports_no_direct_map can_set_direct_map
+#endif
Hm, wouldn't it be better to have an opt-in per arch, and really only
unlock the ones we know work (tested etc), explicitly in separate patches.
Ack, can definitely do that. Something like
#ifndef kvm_arch_gmem_supports_no_direct_map
static inline bool kvm_arch_gmem_supports_no_direct_map()
{
return false;
}
#endif
and then actual definitions (in separate patches) in the arm64 and x86
headers?
On a related note, maybe PATCH 2 should only export
set_direct_map_valid_noflush() for the architectures on which we
actually need it? Which would only be x86, since arm64 doesnt allow
building KVM as a module, and nothing else supports guest_memfd right
now.
Yes, that's probably best. Could be done in the same arch patch then.
--
Cheers
David / dhildenb