Hi Cédric,

>-----Original Message-----
>From: Cédric Le Goater <[email protected]>
>Subject: Re: [PATCH v9 00/19] intel_iommu: Enable first stage translation for
>passthrough device
>
>Hello Zhenzhong
>
>On 12/15/25 07:50, Zhenzhong Duan wrote:
>> Hi,
>>
>> Based on Cédric's suggestions[1], The nesting series v8 is split to
>> "base nesting series" + "ERRATA_772415_SPR17 quirk series", this is the
>> base nesting series.
>>
>> For passthrough device with intel_iommu.x-flts=on, we don't do shadowing
>of
>> guest page table but pass first stage page table to host side to construct a
>> nested HWPT. There was some effort to enable this feature in old days, see
>> [2] for details.
>>
>> The key design is to utilize the dual-stage IOMMU translation (also known as
>> IOMMU nested translation) capability in host IOMMU. As the below
>diagram shows,
>> guest I/O page table pointer in GPA (guest physical address) is passed to
>host
>> and be used to perform the first stage address translation. Along with it,
>> modifications to present mappings in the guest I/O page table should be
>followed
>> with an IOTLB invalidation.
>>
>>          .-------------.  .---------------------------.
>>          |   vIOMMU    |  | Guest I/O page table      |
>>          |             |  '---------------------------'
>>          .----------------/
>>          | PASID Entry |--- PASID cache flush --+
>>          '-------------'                        |
>>          |             |                        V
>>          |             |           I/O page table pointer in GPA
>>          '-------------'
>>      Guest
>>      ------| Shadow |---------------------------|--------
>>            v        v                           v
>>      Host
>>          .-------------.  .-----------------------------.
>>          |   pIOMMU    |  | First stage for GIOVA->GPA  |
>>          |             |  '-----------------------------'
>>          .----------------/  |
>>          | PASID Entry |     V (Nested xlate)
>>          '----------------\.--------------------------------------------.
>>          |             |   | Second stage for GPA->HPA, unmanaged
>domain|
>>          |             |   '--------------------------------------------'
>>          '-------------'
>> <Intel VT-d Nested translation>
>>
>> This series reuse VFIO device's default HWPT as nesting parent instead of
>> creating new one. This way avoids duplicate code of a new memory
>listener,
>> all existing feature from VFIO listener can be shared, e.g., ram discard,
>> dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
>> under a PCI bridge with emulated device, because emulated device wants
>> IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or
>> reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off" on
>platform
>> with ERRATA_772415_SPR17, because VFIO device's default HWPT is
>created
>> with NEST_PARENT flag, kernel inhibit RO mappings when switch to shadow
>> mode.
>>
>> This series is also a prerequisite work for vSVA, i.e. Sharing guest
>> application address space with passthrough devices.
>>
>> There are some interactions between VFIO and vIOMMU
>> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>>    subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>>    instance to vIOMMU at vfio device realize stage.
>> * vIOMMU registers PCIIOMMUOps get_viommu_flags to PCI subsystem.
>>    VFIO calls it to get vIOMMU exposed flags.
>> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>>    to bind/unbind device to IOMMUFD backed domains, either nested
>>    domain or not.
>>
>> See below diagram:
>>
>>          VFIO Device                                 Intel
>IOMMU
>>      .-----------------.                         .-------------------.
>>      |                 |                         |
>|
>>      |       .---------|PCIIOMMUOps              |.-------------.
>|
>>      |       | IOMMUFD |(set/unset_iommu_device) || Host IOMMU
>|    |
>>      |       | Device  |------------------------>|| Device list |    |
>>      |       .---------|(get_viommu_flags)       |.-------------.    |
>>      |                 |                         |       |
>|
>>      |                 |                         |       V
>|
>>      |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.
>|
>>      |       | IOMMUFD |            (attach_hwpt)|  | Host
>IOMMU  |  |
>>      |       | link    |<------------------------|  |   Device    |  |
>>      |       .---------|            (detach_hwpt)|  .-------------.  |
>>      |                 |                         |       |
>|
>>      |                 |                         |       ...
>|
>>      .-----------------.                         .-------------------.
>>
>> Below is an example to enable first stage translation for passthrough
>device:
>>
>>      -M q35,...
>>      -device intel-iommu,x-scalable-mode=on,x-flts=on...
>>      -object iommufd,id=iommufd0 -device
>vfio-pci,iommufd=iommufd0,...
>
>What about libvirt support ? There are patches to enable IOMMUFD
>support with device assignment but I don't see anything related
>to first stage translation. Is there a plan ?

I think IOMMUFD support in libvirt is non-trivial, good to know there is 
progress.
But I didn't find a match in libvirt mailing list, 
https://lists.libvirt.org/archives/search?q=iommufd
Do you have a link?

I think first stage support is trivial, only to support a new property 
<...x-flts=on/off>.
I can apply a few time resource from my manager to work on it after this series 
is merged.
It's also welcome if anyone is interested to take it.

>
>This raises a question. Should ftls support be automatically enabled
>based on the availability of an IOMMUFD backend ?

Yes, if user doesn't force it off, like <...iommufd='off'> and IOMMUFD backend 
available, we can enable it automatically.

>
>>
>> Test done:
>> - VFIO devices hotplug/unplug
>> - different VFIO devices linked to different iommufds
>> - vhost net device ping test
>> - migration with QAT passthrough
>
>Did you do any experiments with active mlx5 VFs ?

No, there are only a few device drivers supporting VFIO migration and we only 
have QAT.
Let me know if you see issue on other devices.

Thanks
Zhenzhong

Reply via email to