>-----Original Message----- >From: Cédric Le Goater <c...@redhat.com> >Subject: Re: [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for >passthrough device > >On 5/21/25 13:14, Zhenzhong Duan wrote: >> Hi, >> >> Per Jason Wang's suggestion, iommufd nesting series[1] is split into >> "Enable stage-1 translation for emulated device" series and >> "Enable stage-1 translation for passthrough device" series. >> >> This series is 2nd part focusing on passthrough device. We don't do >> shadowing of guest page table for passthrough device but pass stage-1 >> page table to host side to construct a nested domain. There was some >> effort to enable this feature in old days, see [2] for details. >> >> The key design is to utilize the dual-stage IOMMU translation >> (also known as IOMMU nested translation) capability in host IOMMU. >> As the below diagram shows, guest I/O page table pointer in GPA >> (guest physical address) is passed to host and be used to perform >> the stage-1 address translation. Along with it, modifications to >> present mappings in the guest I/O page table should be followed >> with an IOTLB invalidation. >> >> .-------------. .---------------------------. >> | vIOMMU | | Guest I/O page table | >> | | '---------------------------' >> .----------------/ >> | PASID Entry |--- PASID cache flush --+ >> '-------------' | >> | | V >> | | I/O page table pointer in GPA >> '-------------' >> Guest >> ------| Shadow |---------------------------|-------- >> v v v >> Host >> .-------------. .------------------------. >> | pIOMMU | | Stage1 for GIOVA->GPA | >> | | '------------------------' >> .----------------/ | >> | PASID Entry | V (Nested xlate) >> '----------------\.--------------------------------------. >> | | | Stage2 for GPA->HPA, unmanaged domain| >> | | '--------------------------------------' >> '-------------' >> For history reason, there are different namings in different VTD spec rev, >> Where: >> - Stage1 = First stage = First level = flts >> - Stage2 = Second stage = Second level = slts >> <Intel VT-d Nested translation> >> >> There are some interactions between VFIO and vIOMMU >> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI >> subsystem. VFIO calls them to register/unregister HostIOMMUDevice >> instance to vIOMMU at vfio device realize stage. >> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt >> to bind/unbind device to IOMMUFD backed domains, either nested >> domain or not. >> >> See below diagram: >> >> VFIO Device Intel IOMMU >> .-----------------. .-------------------. >> | | | | >> | .---------|PCIIOMMUOps |.-------------. | >> | | IOMMUFD |(set_iommu_device) || Host IOMMU | | >> | | Device |------------------------>|| Device list | | >> | .---------|(unset_iommu_device) |.-------------. | >> | | | | | >> | | | V | >> | .---------| HostIOMMUDeviceIOMMUFD | .-------------. | >> | | IOMMUFD | (attach_hwpt)| | Host IOMMU | | >> | | link |<------------------------| | Device | | >> | .---------| (detach_hwpt)| .-------------. | >> | | | | | >> | | | ... | >> .-----------------. .-------------------. >> >> Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt >> whenever possible and create new one on demand, also supports multiple >> iommufd objects and ERRATA_772415. >> >> E.g., Under one guest's scope, Stage-2 page table could be shared by >> different >> devices if there is no conflict and devices link to same iommufd object, >> i.e. devices under same host IOMMU can share same stage-2 page table. If >there >> is conflict, i.e. there is one device under non cache coherency mode which is >> different from others, it requires a separate stage-2 page table in non-CC >> mode. >> >> SPR platform has ERRATA_772415 which requires no readonly mappings >> in stage-2 page table. This series supports creating VTDIOASContainer >> with no readonly mappings. If there is a rare case that some IOMMUs >> on a multiple IOMMU host have ERRATA_772415 and others not, this >> design can still survive. >> >> See below example diagram for a full view: >> >> IntelIOMMUState >> | >> V >> .------------------. .------------------. .-------------------. >> | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer |-- >>... >> | (iommufd0,RW&RO) | | (iommufd1,RW&RO) | | (iommufd0,only RW)| >> .------------------. .------------------. .-------------------. >> | | | >> | .-->... | >> V V >> .-------------------. .-------------------. >> .---------------. >> | VTDS2Hwpt(CC) |--->| VTDS2Hwpt(non-CC) |-->... | >> VTDS2Hwpt(CC) |- >->... >> .-------------------. .-------------------. >> .---------------. >> | | | | >> | | | | >> .-----------. .-----------. .------------. .------------. >> | IOMMUFD | | IOMMUFD | | IOMMUFD | | IOMMUFD | >> | Device(CC)| | Device(CC)| | Device | | Device(CC) | >> | (iommufd0)| | (iommufd0)| | (non-CC) | | (errata) | >> | | | | | (iommufd0) | | (iommufd0) | >> .-----------. .-----------. .------------. .------------. >> >> This series is also a prerequisite work for vSVA, i.e. Sharing >> guest application address space with passthrough devices. >> >> To enable stage-1 translation, only need to add >> "x-scalable-mode=on,x-flts=on". >> i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on... >> >> Passthrough device should use iommufd backend to work with stage-1 >translation. >> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,... >> >> If host doesn't support nested translation, qemu will fail with an >> unsupported >> report. >> >> Test done: >> - VFIO devices hotplug/unplug >> - different VFIO devices linked to different iommufds >> - vhost net device ping test >> >> Fault report isn't supported in this series, we presume guest kernel always >> construct correct S1 page table for passthrough device. For emulated devices, >> the emulation code already provided S1 fault injection. >> >> PATCH1-6: Add HWPT-based nesting infrastructure support > >The first 6 patches are all VFIO or IOMMUFD related. They are >mostly additions and I didn't see anything wrong. They could >be merged in advance through the VFIO tree.
OK, I'll send a prerequisite series containing only the first 6 patches with suggested changes recently. Thanks Zhenzhong