Hello all, as previously announced here, we are working on an integration that will expose the HyperV hypervisor to QEMU on Linux hosts. HyperV is a Type 1 hypervisor with a layered architecture that features a "root partition" alongside VMs as "child partitions" that will interface with the hypervisor and has access to the hardware. (https://aka.ms/hypervarch)
The effort to run Linux on such a Root Partition and expose HyperV to such a management partition is called "MSHV". Sometimes we refer to the root partition as "Dom0 Linux". Today we are targetting nested virtualization, that is: the creation + management of L2 VMs on an L1 VM (L0 would indicate bare metal). +-------------+ +----------------+ +--------------+ | | | | | | | Azure Host | | L1 Linux Dom0 | | L2 Guest VM | | | | | | | | OS | | | | | | | | +------------+ | | | | | | | Qemu VMM | | | | | | | +------------+ | | | | | | +------------+ | | | | | | | Kernel | | | | | | | +-----+------+ | | | | | +-------|--------+ +--------------+ | | +-------v-------------------------+ | | | Microsoft Hypervisor (L1) | +-------------+ +-------+-------------------------+ | +-----------------------v-------------------------+ | Microsoft Hypervisor (L0) | +-------------------------------------------------+ +-------------------------------------------------+ | | | Hardware | | | +-------------------------------------------------+ This submission is a port of the existing MSHV integration that is shipped in Cloud-Hypervisor and MSHV-specific rust-crates in rust-vmm. There are various products like AKS Pod Sandboxing and AKS Confidential Pods built on MSHV and Cloud-Hypervisor. We hope to achieve a seamless integration into the QEMU accelerator framework, similar to existing integrations like KVM, HVF or WHPX. The patch set has been split into chunks that should be applicable and buildable individually, but only the full set of commits will allow launching MSHV-accelerated guests on supported kernels and environments. The toggle to enable the feature at build time would be: `./configure --enable-mshv`. When launching a VM, the accelerator `mshv` can be enabled via `-accel mshv` or `-machine q35,accel=mshv`. We concluded the porting, but we haven't performed any comprehensive testing yet. We opted to send our submission early to receive feedback about the general structure and potential problems of our integration. Most likely we will uncover problems during testing and address those in upcoming revisions of the patch set. The configuration we are using during development: machine q35 + OVMF + various recent linux distros as guests (fedora 42, ubuntu 22.04) We would welcome any feedback around the structure and integration points that we chose, so we can incorporate them into upcoming revisions. Some notes/caveats about the initial submission: - The relevant MSHV kernel code has been accepted for inclusion in the upcoming 6.15 release, which should be released shortly. To allow building it on older kernel we vendored the kernel headers that define the MSHV ABI into the patch set. We might remove it in later revisions of the patch set, or put it behind a feature toggle. Once the kernel is released we plan to published a preconfigured Azure image, which can be used to test the MSHV accelerator. - QEMU is mapping regions into the guest that might partially overlap in their userspace_addr range (e.g. for ROMs in early boot). Currently MSHV will reject such overlaps. We are looking into whether we can/want to relax that restriction. To work around this we maintain a list of mapping references and swap in/out regions if there's a GPA fault and we find a valid candidate region in our list. (see last commit). Maybe there are alternative, less invasive, suggestions. We'd be happy to hear those. - We noticed that when using SeaBIOS, in certain permutations of guest configuration (> 2GB ram & >1 virtio-blk-pci devices), we run into unmapped GPA errors. We suspect it has to do with SeaBIOS addressing memory in the 4GB+ region in those cases. We are investigating, and will hopefully be able to issue a fix soon. For the time being this can be worked around by using OVMF as firmware: - Since the MHSV accelerator requires a HyperV hypervisor to be present, it would make sense to provide testing infrastructure for integration testing on Azure. We are looking into options how to implement that. best, magnus Magnus Kulke (25): accel: Add Meson and config support for MSHV accelerator target/i386/emulate: allow instruction decoding from stream target/i386/mshv: Add x86 decoder/emu implementation hw/intc: Generalize APIC helper names from kvm_* to accel_* include/hw/hyperv: Add MSHV ABI header definitions accel/mshv: Add accelerator skeleton accel/mshv: Register memory region listeners accel/mshv: Initialize VM partition accel/mshv: Register guest memory regions with hypervisor accel/mshv: Add ioeventfd support accel/mshv: Add basic interrupt injection support accel/mshv: Add vCPU creation and execution loop accel/mshv: Add vCPU signal handling target/i386/mshv: Add CPU create and remove logic target/i386/mshv: Implement mshv_store_regs() target/i386/mshv: Implement mshv_get_standard_regs() target/i386/mshv: Implement mshv_get_special_regs() target/i386/mshv: Implement mshv_arch_put_registers() target/i386/mshv: Set local interrupt controller state target/i386/mshv: Register CPUID entries with MSHV target/i386/mshv: Register MSRs with MSHV target/i386/mshv: Integrate x86 instruction decoder/emulator target/i386/mshv: Write MSRs to the hypervisor target/i386/mshv: Implement mshv_vcpu_run() accel/mshv: Add memory remapping workaround accel/Kconfig | 3 + accel/accel-irq.c | 95 ++ accel/meson.build | 3 +- accel/mshv/irq.c | 370 +++++++ accel/mshv/mem.c | 434 ++++++++ accel/mshv/meson.build | 9 + accel/mshv/mshv-all.c | 731 ++++++++++++ accel/mshv/msr.c | 375 +++++++ accel/mshv/trace-events | 20 + accel/mshv/trace.h | 1 + hw/intc/apic.c | 9 + hw/intc/ioapic.c | 20 +- hw/virtio/virtio-pci.c | 19 +- include/hw/hyperv/hvgdk.h | 20 + include/hw/hyperv/hvhdk.h | 165 +++ include/hw/hyperv/hvhdk_mini.h | 106 ++ include/hw/hyperv/linux-mshv.h | 1038 ++++++++++++++++++ include/system/accel-irq.h | 26 + include/system/mshv.h | 237 ++++ meson.build | 17 + meson_options.txt | 2 + scripts/meson-buildoptions.sh | 3 + target/i386/cpu.h | 2 +- target/i386/emulate/meson.build | 7 +- target/i386/emulate/x86_decode.c | 32 +- target/i386/emulate/x86_decode.h | 11 + target/i386/emulate/x86_emu.c | 3 +- target/i386/emulate/x86_emu.h | 1 + target/i386/meson.build | 2 + target/i386/mshv/meson.build | 8 + target/i386/mshv/mshv-cpu.c | 1768 ++++++++++++++++++++++++++++++ target/i386/mshv/x86.c | 330 ++++++ 32 files changed, 5841 insertions(+), 26 deletions(-) create mode 100644 accel/accel-irq.c create mode 100644 accel/mshv/irq.c create mode 100644 accel/mshv/mem.c create mode 100644 accel/mshv/meson.build create mode 100644 accel/mshv/mshv-all.c create mode 100644 accel/mshv/msr.c create mode 100644 accel/mshv/trace-events create mode 100644 accel/mshv/trace.h create mode 100644 include/hw/hyperv/hvgdk.h create mode 100644 include/hw/hyperv/hvhdk.h create mode 100644 include/hw/hyperv/hvhdk_mini.h create mode 100644 include/hw/hyperv/linux-mshv.h create mode 100644 include/system/accel-irq.h create mode 100644 include/system/mshv.h create mode 100644 target/i386/mshv/meson.build create mode 100644 target/i386/mshv/mshv-cpu.c create mode 100644 target/i386/mshv/x86.c -- 2.34.1