On Thu, May 15, 2025 at 01:36:38PM -0700, Nathan Chen via Devel wrote:
> Hi,
> 
> This is a follow up to the first RFC patchset [0] for supporting multiple
> vSMMU instances in a qemu VM. This patchset also introduces support for
> using iommufd to propagate DMA mappings to kernel for assigned devices.
> 
> This patchset implements support for specifying multiple <iommu> devices
> within the VM definition when smmuv3Dev IOMMU model is specified, and is
> tested with Shameer's latest qemu RFC for HW-accelerated vSMMU devices [1]
> 
> Moreover, it adds a new 'iommufd' member for virDomainIOMMUDef,
> in order to represent the iommufd object in qemu command line. This
> patchset also implements new 'iommufdId' and 'iommufdFd' attributes for
> hostdev devices to be associated with the iommufd object.
> 
> For instance, specifying the iommufd object and associated hostdev in a
> VM definition with multiple IOMMUs, configured to be routed to
> pcie-expander-bus controllers in a way where VFIO device to SMMUv3
> associations are matched with the host (pcie-expander-bus and
> pcie-root-port controllers are no longer auto-added/auto-routed
> like in the first revision of this RFC, as the PCIe topology will be
> configured by management apps):
> 
>   <devices>
> ...
>     <controller type='pci' index='1' model='pcie-expander-bus'>
>       <model name='pxb-pcie'/>
>       <target busNr='252'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x01' 
> function='0x0'/>
>     </controller>
>     <controller type='pci' index='2' model='pcie-expander-bus'>
>       <model name='pxb-pcie'/>
>       <target busNr='248'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x02' 
> function='0x0'/>
>     </controller>
> ...
>     <controller type='pci' index='21' model='pcie-root-port'>
>       <model name='pcie-root-port'/>
>       <target chassis='21' port='0x0'/>
>       <address type='pci' domain='0x0000' bus='0x01' slot='0x00' 
> function='0x0'/>
>     </controller>
>     <controller type='pci' index='22' model='pcie-root-port'>
>       <model name='pcie-root-port'/>
>       <target chassis='22' port='0xa8'/>
>       <address type='pci' domain='0x0000' bus='0x02' slot='0x00' 
> function='0x0'/>
>     </controller>
> ...
>     <hostdev mode='subsystem' type='pci' managed='no'>
>       <source>
>         <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/>
>       </source>
>       <iommufdId>iommufd0</iommufdId>
>       <address type='pci' domain='0x0000' bus='0x15' slot='0x00' 
> function='0x0'/>
>     </hostdev>
>     <hostdev mode='subsystem' type='pci' managed='no'>
>       <source>
>         <address domain='0x0019' bus='0x01' slot='0x00' function='0x0'/>
>       </source>
>       <iommufdId>iommufd0</iommufdId>
>       <address type='pci' domain='0x0000' bus='0x16' slot='0x00' 
> function='0x0'/>
>     </hostdev>
>     <iommu model='smmuv3Dev'>
>       <iommufd>
>         <id>iommufd0</id>
>       </iommufd>
>       <address type='pci' domain='0x0000' bus='0x01' slot='0x01' 
> function='0x0'/>

IIUC, you're using <address> here to reference the earlier <controller>
pcie-expander-bus. This is a bit wierd as it is making it look like the
smmuv3Dev itself has a PCI address, but this is just the PCI address
of the controller.

The smmuv3dev also doesn't have an address on the pcie-expander-bus,
it is just an association IIUC.

So from this pov, I think I'd be inclined to say we should just
reference the <controller> based on its index, using an attribute

  <iommu model='smmuv3dev' controller='2'/>


>     </iommu>
>     <iommu model='smmuv3Dev'>
>       <iommufd>
>         <id>iommufd0</id>
>       </iommufd>
>       <address type='pci' domain='0x0000' bus='0x02' slot='0x01' 
> function='0x0'/>
>     </iommu>
>   </devices>
> 
> This would get translated to a qemu command line with the arguments below:
> 
>  -device 
> '{"driver":"pxb-pcie","bus_nr":252,"id":"pci.1","bus":"pcie.0","addr":"0x1"}' 
> \
>  -device 
> '{"driver":"pxb-pcie","bus_nr":248,"id":"pci.2","bus":"pcie.0","addr":"0x2"}' 
> \
>  -device 
> '{"driver":"pcie-root-port","port":0,"chassis":21,"id":"pci.21","bus":"pci.1","addr":"0x0"}'
>  \
>  -device 
> '{"driver":"pcie-root-port","port":168,"chassis":22,"id":"pci.22","bus":"pci.2","addr":"0x0"}'
>  \
>  -object '{"qom-type":"iommufd","id":"iommufd0"}' \
>  -device '{"driver":"arm-smmuv3-accel","bus":"pci.1"}' \
>  -device '{"driver":"arm-smmuv3-accel","bus":"pci.2"}' \
>  -device 
> '{"driver":"vfio-pci","host":"0009:01:00.0","id":"hostdev0","iommufd":"iommufd0","bus":"pci.21","addr":"0x0"}'
>  \
>  -device 
> '{"driver":"vfio-pci","host":"0019:01:00.0","id":"hostdev1","iommufd":"iommufd0","bus":"pci.22","addr":"0x0"}'
>  \

The iommufd integration in the XML looks a bit wierd too - we have
four different elements all referencing 'iommufd0'  but nothing
is defining this. The iommu references the iommufd0, but nothing
actually uses this on the arm-smuv3-accel command line.


I've not been paying much attention to iommufd in QEMU, but IIUC
it will apply to x86_64 too. So I'm wondering how iommufd integration
sound work in libvirt more broadly.

> If users would like to leverage qemu's iommufd feature to open the VFIO
> cdev and /dev/iommu via an external management layer, the fd can be
> specified like so in the VM definition:
> 
>   <devices>
>     <hostdev mode='subsystem' type='pci' managed='yes'>
>       <driver name='vfio'/>
>       <source>
>         <address domain='0x0000' bus='0x06' slot='0x12' function='0x2'/>
>       </source>
>       <iommufdId>iommufd0</iommufdId>
>       <iommufdFd>23</iommufdFd>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x03' 
> function='0x0'/>
>     </hostdev>
>     <iommu model='intel'>
>       <iommufd>
>         <id>iommufd0</id>
>         <fd>22</fd>
>       </iommufd>
>     </iommu>
>   </devices>
> 
> This would get translated to a qemu command line with the arguments below:
> 
> -object '{"qom-type":"iommufd","id":"iommufd0","fd":"22"}' \
> -device 
> '{"driver":"vfio-pci","host":"0000:06:12.2","id":"hostdev1","iommufd":"iommufd0","fd":"23","bus":"pci.0","addr":"0x3"}'
>  \

I'm not getting why we have multiple different FDs here, when
we only have a single iommufd for the VMs ?

> 
> Summary of changes:
> - Introduced support for specifying multiple <iommu> stanzas in the VM
> XML definition when using smmuv3Dev model.
> - Automating PCIe topology to populate VM definition with multiple vSMMUs
> routed to pcie-expander-bus controllers is excluded, in favor of
> deferring creation of PXBs and routing of VFIO devices to management apps.
> - Introduced iommufd support.
> 
> TODO:
> - I updated the namespace and cgroup configuration to allow access to iommufd
> paths at /dev/vfio/devices/vfio* and /dev/iommu. However, qemu needs to be
> launched with user and group set to 'root' in order for these paths to be
> accessible. A passthrough device represented by /dev/vfio/18 normally has
> 'root' user and group permissions, but in the mount namespace it's changed to
> 'libvirt-qemu' and 'kvm'. I wasn't able to discern where this is happening by
> looking at src/qemu/qemu_namespace.c and src/qemu/qemu_cgroup.c. Would you 
> have
> any pointers on how to change the iommufd paths' user and group permissions in
> the libvirt mount namespace?

All permissions are handled by the security managers in src/security,
both DAC file permissions/ownership and SELinux labelling.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Reply via email to