Hello all,

as previously announced here, we are working on an integration that will
expose the HyperV hypervisor to QEMU on Linux hosts. HyperV is a Type 1
hypervisor with a layered architecture that features a "root partition"
alongside VMs as "child partitions" that will interface with the
hypervisor and has access to the hardware. (https://aka.ms/hypervarch)

The effort to run Linux on such a Root Partition and expose HyperV to
such a management partition is called "MSHV". Sometimes we refer to the
root partition as "Dom0 Linux". Today we are targetting nested
virtualization, that is: the creation + management of L2 VMs on an L1
VM (L0 would indicate bare metal).

+-------------+ +----------------+ +--------------+
|             | |                | |              |
| Azure Host  | | L1 Linux Dom0  | | L2 Guest VM  |
|             | |                | |              |
|     OS      | |                | |              |
|             | | +------------+ | |              |
|             | | |  Qemu VMM  | | |              |
|             | | +------------+ | |              |
|             | | +------------+ | |              |
|             | | |   Kernel   | | |              |
|             | | +-----+------+ | |              |
|             | +-------|--------+ +--------------+
|             | +-------v-------------------------+
|             | |    Microsoft Hypervisor (L1)    |
+-------------+ +-------+-------------------------+
                        |
+-----------------------v-------------------------+
|            Microsoft Hypervisor (L0)            |
+-------------------------------------------------+

+-------------------------------------------------+
|                                                 |
|                    Hardware                     |
|                                                 |
+-------------------------------------------------+

This submission is a port of the existing MSHV integration that is
shipped in Cloud-Hypervisor and MSHV-specific rust-crates in rust-vmm.
There are various products like AKS Pod Sandboxing and AKS Confidential
Pods built on MSHV and Cloud-Hypervisor. We hope to achieve a seamless
integration into the QEMU accelerator framework, similar to existing
integrations like KVM, HVF or WHPX.

The patch set has been split into chunks that should be applicable and
buildable individually, but only the full set of commits will allow
launching MSHV-accelerated guests on supported kernels and environments.

The toggle to enable the feature at build time would be: `./configure
--enable-mshv`.

When launching a VM, the accelerator `mshv` can be enabled via

`-accel mshv` or `-machine q35,accel=mshv`.

We concluded the porting, but we haven't performed any comprehensive
testing yet. We opted to send our submission early to receive feedback
about the general structure and potential problems of our integration.
Most likely we will uncover problems during testing and address those in
upcoming revisions of the patch set.

The configuration we are using during development:

machine q35 + OVMF + various recent linux distros as guests (fedora
42, ubuntu 22.04)

We would welcome any feedback around the structure and integration
points that we chose, so we can incorporate them into upcoming
revisions.

Some notes/caveats about the initial submission:

- The relevant MSHV kernel code has been accepted for inclusion in the
  upcoming 6.15 release, which should be released shortly. To allow
  building it on older kernel we vendored the kernel headers that define
  the MSHV ABI into the patch set. We might remove it in later
  revisions of the patch set, or put it behind a feature toggle. Once
  the kernel is released we plan to published a preconfigured Azure
  image, which can be used to test the MSHV accelerator.

- QEMU is mapping regions into the guest that might partially overlap in
  their userspace_addr range (e.g. for ROMs in early boot). Currently
  MSHV will reject such overlaps. We are looking into whether we
  can/want to relax that restriction. To work around this we maintain a
  list of mapping references and swap in/out regions if there's a GPA
  fault and we find a valid candidate region in our list. (see last
  commit). Maybe there are alternative, less invasive, suggestions. We'd
  be happy to hear those.

- We noticed that when using SeaBIOS, in certain permutations of guest
  configuration (> 2GB ram & >1 virtio-blk-pci devices), we run into
  unmapped GPA errors. We suspect it has to do with SeaBIOS addressing
  memory in the 4GB+ region in those cases. We are investigating, and
  will hopefully be able to issue a fix soon. For the time being this
  can be worked around by using OVMF as firmware:

- Since the MHSV accelerator requires a HyperV hypervisor to be present,
  it would make sense to provide testing infrastructure for integration
  testing on Azure. We are looking into options how to implement that.

best,

magnus

Magnus Kulke (25):
  accel: Add Meson and config support for MSHV accelerator
  target/i386/emulate: allow instruction decoding from stream
  target/i386/mshv: Add x86 decoder/emu implementation
  hw/intc: Generalize APIC helper names from kvm_* to accel_*
  include/hw/hyperv: Add MSHV ABI header definitions
  accel/mshv: Add accelerator skeleton
  accel/mshv: Register memory region listeners
  accel/mshv: Initialize VM partition
  accel/mshv: Register guest memory regions with hypervisor
  accel/mshv: Add ioeventfd support
  accel/mshv: Add basic interrupt injection support
  accel/mshv: Add vCPU creation and execution loop
  accel/mshv: Add vCPU signal handling
  target/i386/mshv: Add CPU create and remove logic
  target/i386/mshv: Implement mshv_store_regs()
  target/i386/mshv: Implement mshv_get_standard_regs()
  target/i386/mshv: Implement mshv_get_special_regs()
  target/i386/mshv: Implement mshv_arch_put_registers()
  target/i386/mshv: Set local interrupt controller state
  target/i386/mshv: Register CPUID entries with MSHV
  target/i386/mshv: Register MSRs with MSHV
  target/i386/mshv: Integrate x86 instruction decoder/emulator
  target/i386/mshv: Write MSRs to the hypervisor
  target/i386/mshv: Implement mshv_vcpu_run()
  accel/mshv: Add memory remapping workaround

 accel/Kconfig                    |    3 +
 accel/accel-irq.c                |   95 ++
 accel/meson.build                |    3 +-
 accel/mshv/irq.c                 |  370 +++++++
 accel/mshv/mem.c                 |  434 ++++++++
 accel/mshv/meson.build           |    9 +
 accel/mshv/mshv-all.c            |  731 ++++++++++++
 accel/mshv/msr.c                 |  375 +++++++
 accel/mshv/trace-events          |   20 +
 accel/mshv/trace.h               |    1 +
 hw/intc/apic.c                   |    9 +
 hw/intc/ioapic.c                 |   20 +-
 hw/virtio/virtio-pci.c           |   19 +-
 include/hw/hyperv/hvgdk.h        |   20 +
 include/hw/hyperv/hvhdk.h        |  165 +++
 include/hw/hyperv/hvhdk_mini.h   |  106 ++
 include/hw/hyperv/linux-mshv.h   | 1038 ++++++++++++++++++
 include/system/accel-irq.h       |   26 +
 include/system/mshv.h            |  237 ++++
 meson.build                      |   17 +
 meson_options.txt                |    2 +
 scripts/meson-buildoptions.sh    |    3 +
 target/i386/cpu.h                |    2 +-
 target/i386/emulate/meson.build  |    7 +-
 target/i386/emulate/x86_decode.c |   32 +-
 target/i386/emulate/x86_decode.h |   11 +
 target/i386/emulate/x86_emu.c    |    3 +-
 target/i386/emulate/x86_emu.h    |    1 +
 target/i386/meson.build          |    2 +
 target/i386/mshv/meson.build     |    8 +
 target/i386/mshv/mshv-cpu.c      | 1768 ++++++++++++++++++++++++++++++
 target/i386/mshv/x86.c           |  330 ++++++
 32 files changed, 5841 insertions(+), 26 deletions(-)
 create mode 100644 accel/accel-irq.c
 create mode 100644 accel/mshv/irq.c
 create mode 100644 accel/mshv/mem.c
 create mode 100644 accel/mshv/meson.build
 create mode 100644 accel/mshv/mshv-all.c
 create mode 100644 accel/mshv/msr.c
 create mode 100644 accel/mshv/trace-events
 create mode 100644 accel/mshv/trace.h
 create mode 100644 include/hw/hyperv/hvgdk.h
 create mode 100644 include/hw/hyperv/hvhdk.h
 create mode 100644 include/hw/hyperv/hvhdk_mini.h
 create mode 100644 include/hw/hyperv/linux-mshv.h
 create mode 100644 include/system/accel-irq.h
 create mode 100644 include/system/mshv.h
 create mode 100644 target/i386/mshv/meson.build
 create mode 100644 target/i386/mshv/mshv-cpu.c
 create mode 100644 target/i386/mshv/x86.c

-- 
2.34.1


Reply via email to