Hi Cédric, >-----Original Message----- >From: Cédric Le Goater <[email protected]> >Subject: Re: [PATCH v9 00/19] intel_iommu: Enable first stage translation for >passthrough device > >Hello Zhenzhong > >On 12/15/25 07:50, Zhenzhong Duan wrote: >> Hi, >> >> Based on Cédric's suggestions[1], The nesting series v8 is split to >> "base nesting series" + "ERRATA_772415_SPR17 quirk series", this is the >> base nesting series. >> >> For passthrough device with intel_iommu.x-flts=on, we don't do shadowing >of >> guest page table but pass first stage page table to host side to construct a >> nested HWPT. There was some effort to enable this feature in old days, see >> [2] for details. >> >> The key design is to utilize the dual-stage IOMMU translation (also known as >> IOMMU nested translation) capability in host IOMMU. As the below >diagram shows, >> guest I/O page table pointer in GPA (guest physical address) is passed to >host >> and be used to perform the first stage address translation. Along with it, >> modifications to present mappings in the guest I/O page table should be >followed >> with an IOTLB invalidation. >> >> .-------------. .---------------------------. >> | vIOMMU | | Guest I/O page table | >> | | '---------------------------' >> .----------------/ >> | PASID Entry |--- PASID cache flush --+ >> '-------------' | >> | | V >> | | I/O page table pointer in GPA >> '-------------' >> Guest >> ------| Shadow |---------------------------|-------- >> v v v >> Host >> .-------------. .-----------------------------. >> | pIOMMU | | First stage for GIOVA->GPA | >> | | '-----------------------------' >> .----------------/ | >> | PASID Entry | V (Nested xlate) >> '----------------\.--------------------------------------------. >> | | | Second stage for GPA->HPA, unmanaged >domain| >> | | '--------------------------------------------' >> '-------------' >> <Intel VT-d Nested translation> >> >> This series reuse VFIO device's default HWPT as nesting parent instead of >> creating new one. This way avoids duplicate code of a new memory >listener, >> all existing feature from VFIO listener can be shared, e.g., ram discard, >> dirty tracking, etc. Two limitations are: 1) not supporting VFIO device >> under a PCI bridge with emulated device, because emulated device wants >> IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or >> reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off" on >platform >> with ERRATA_772415_SPR17, because VFIO device's default HWPT is >created >> with NEST_PARENT flag, kernel inhibit RO mappings when switch to shadow >> mode. >> >> This series is also a prerequisite work for vSVA, i.e. Sharing guest >> application address space with passthrough devices. >> >> There are some interactions between VFIO and vIOMMU >> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI >> subsystem. VFIO calls them to register/unregister HostIOMMUDevice >> instance to vIOMMU at vfio device realize stage. >> * vIOMMU registers PCIIOMMUOps get_viommu_flags to PCI subsystem. >> VFIO calls it to get vIOMMU exposed flags. >> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt >> to bind/unbind device to IOMMUFD backed domains, either nested >> domain or not. >> >> See below diagram: >> >> VFIO Device Intel >IOMMU >> .-----------------. .-------------------. >> | | | >| >> | .---------|PCIIOMMUOps |.-------------. >| >> | | IOMMUFD |(set/unset_iommu_device) || Host IOMMU >| | >> | | Device |------------------------>|| Device list | | >> | .---------|(get_viommu_flags) |.-------------. | >> | | | | >| >> | | | V >| >> | .---------| HostIOMMUDeviceIOMMUFD | .-------------. >| >> | | IOMMUFD | (attach_hwpt)| | Host >IOMMU | | >> | | link |<------------------------| | Device | | >> | .---------| (detach_hwpt)| .-------------. | >> | | | | >| >> | | | ... >| >> .-----------------. .-------------------. >> >> Below is an example to enable first stage translation for passthrough >device: >> >> -M q35,... >> -device intel-iommu,x-scalable-mode=on,x-flts=on... >> -object iommufd,id=iommufd0 -device >vfio-pci,iommufd=iommufd0,... > >What about libvirt support ? There are patches to enable IOMMUFD >support with device assignment but I don't see anything related >to first stage translation. Is there a plan ?
I think IOMMUFD support in libvirt is non-trivial, good to know there is progress. But I didn't find a match in libvirt mailing list, https://lists.libvirt.org/archives/search?q=iommufd Do you have a link? I think first stage support is trivial, only to support a new property <...x-flts=on/off>. I can apply a few time resource from my manager to work on it after this series is merged. It's also welcome if anyone is interested to take it. > >This raises a question. Should ftls support be automatically enabled >based on the availability of an IOMMUFD backend ? Yes, if user doesn't force it off, like <...iommufd='off'> and IOMMUFD backend available, we can enable it automatically. > >> >> Test done: >> - VFIO devices hotplug/unplug >> - different VFIO devices linked to different iommufds >> - vhost net device ping test >> - migration with QAT passthrough > >Did you do any experiments with active mlx5 VFs ? No, there are only a few device drivers supporting VFIO migration and we only have QAT. Let me know if you see issue on other devices. Thanks Zhenzhong
