On 2025/8/22 14:40, Zhenzhong Duan wrote:
Hi,

For passthrough device with intel_iommu.x-flts=on, we don't do shadowing of
guest page table for passthrough device but pass stage-1 page table to host
side to construct a nested domain. There was some effort to enable this feature
in old days, see [1] for details.

The key design is to utilize the dual-stage IOMMU translation (also known as
IOMMU nested translation) capability in host IOMMU. As the below diagram shows,
guest I/O page table pointer in GPA (guest physical address) is passed to host
and be used to perform the stage-1 address translation. Along with it,
modifications to present mappings in the guest I/O page table should be followed
with an IOTLB invalidation.

         .-------------.  .---------------------------.
         |   vIOMMU    |  | Guest I/O page table      |
         |             |  '---------------------------'
         .----------------/
         | PASID Entry |--- PASID cache flush --+
         '-------------'                        |
         |             |                        V
         |             |           I/O page table pointer in GPA
         '-------------'
     Guest
     ------| Shadow |---------------------------|--------
           v        v                           v
     Host
         .-------------.  .------------------------.
         |   pIOMMU    |  | Stage1 for GIOVA->GPA  |
         |             |  '------------------------'
         .----------------/  |
         | PASID Entry |     V (Nested xlate)
         '----------------\.--------------------------------------.
         |             |   | Stage2 for GPA->HPA, unmanaged domain|
         |             |   '--------------------------------------'
         '-------------'
For history reason, there are different namings in different VTD spec rev,
Where:
  - Stage1 = First stage = First level = flts
  - Stage2 = Second stage = Second level = slts
<Intel VT-d Nested translation>

This series reuse VFIO device's default hwpt as nested parent instead of
creating new one. This way avoids duplicate code of a new memory listener,
all existing feature from VFIO listener can be shared, e.g., ram discard,
dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
under a PCI bridge with emulated device, because emulated device wants
IOMMU AS and VFIO device stick to system AS;

should we document it somewhere?

2) not supporting kexec or
reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off", because
VFIO device's default hwpt is created with NEST_PARENT flag, kernel
inhibit RO mappings when switch to shadow mode.

how does guest know this limitation and hold on such attempts?


This series is also a prerequisite work for vSVA, i.e. Sharing guest
application address space with passthrough devices.

There are some interactions between VFIO and vIOMMU
* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
   subsystem. VFIO calls them to register/unregister HostIOMMUDevice
   instance to vIOMMU at vfio device realize stage.
* vIOMMU registers PCIIOMMUOps get_viommu_cap to PCI subsystem.
   VFIO calls it to get vIOMMU exposed capabilities.
* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
   to bind/unbind device to IOMMUFD backed domains, either nested
   domain or not.

See below diagram:

         VFIO Device                                 Intel IOMMU
     .-----------------.                         .-------------------.
     |                 |                         |                   |
     |       .---------|PCIIOMMUOps              |.-------------.    |
     |       | IOMMUFD |(set/unset_iommu_device) || Host IOMMU  |    |
     |       | Device  |------------------------>|| Device list |    |
     |       .---------|(get_viommu_cap)         |.-------------.    |
     |                 |                         |       |           |
     |                 |                         |       V           |
     |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
     |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU  |  |
     |       | link    |<------------------------|  |   Device    |  |
     |       .---------|            (detach_hwpt)|  .-------------.  |
     |                 |                         |       |           |
     |                 |                         |       ...         |
     .-----------------.                         .-------------------.

Below is an example to enable stage-1 translation for passthrough device:

     -M q35,...
     -device intel-iommu,x-scalable-mode=on,x-flts=on...
     -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...

Test done:
- VFIO devices hotplug/unplug
- different VFIO devices linked to different iommufds
- vhost net device ping test

PATCH1-7:  Some preparing work
PATCH8-9:  Compatibility check between vIOMMU and Host IOMMU
PATCH10-18:Implement stage-1 page table for passthrough device
PATCH19-20:Workaround for ERRATA_772415_SPR17
PATCH21:   Enable stage-1 translation for passthrough device

Qemu code can be found at [2]

Fault report isn't supported in this series, we presume guest kernel always
construct correct stage1 page table for passthrough device. For emulated
devices, the emulation code already provided stage1 fault injection.

just call out this series is only limited to gIOVA usage so far. vSVA is
later. :)

Regards,
Yi Liu

Reply via email to