Hi, On 7/16/25 8:26 AM, Duan, Zhenzhong wrote: > >> -----Original Message----- >> From: Nicolin Chen <nicol...@nvidia.com> >> Subject: Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict >> accelerated SMMUv3 to vfio-pci endpoints with iommufd >> >> On Tue, Jul 15, 2025 at 10:53:50AM +0000, Duan, Zhenzhong wrote: >>> >>>> -----Original Message----- >>>> From: Shameer Kolothum <shameerali.kolothum.th...@huawei.com> >>>> Subject: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict >> accelerated >>>> SMMUv3 to vfio-pci endpoints with iommufd >>>> >>>> Accelerated SMMUv3 is only useful when the device can take advantage of >>>> the host's SMMUv3 in nested mode. To keep things simple and correct, we >>>> only allow this feature for vfio-pci endpoint devices that use the iommufd >>>> backend. We also allow non-endpoint emulated devices like PCI bridges >> and >>>> root ports, so that users can plug in these vfio-pci devices. >>>> >>>> Another reason for this limit is to avoid problems with IOTLB >>>> invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an >> associated >>>> SID, making it difficult to trace the originating device. If we allowed >>>> emulated endpoint devices, QEMU would have to invalidate both its own >>>> software IOTLB and the host's hardware IOTLB, which could slow things >>>> down. >>>> >>>> Since vfio-pci devices in nested mode rely on the host SMMUv3's nested >>>> translation (S1+S2), their get_address_space() callback must return the >>>> system address space to enable correct S2 mappings of guest RAM. >>>> >>>> So in short: >>>> - vfio-pci devices return the system address space >>>> - bridges and root ports return the IOMMU address space >>>> >>>> Note: On ARM, MSI doorbell addresses are also translated via SMMUv3. >>> So the translation result is a doorbell addr(gpa) for guest? >>> IIUC, there should be a mapping between guest doorbell addr(gpa) to host >>> doorbell addr(hpa) in stage2 page table? Where is this mapping setup? >> Yes and yes. >> >> On ARM, MSI is behind IOMMU. When 2-stage translation is enabled, >> it goes through two stages as you understood. >> >> There are a few ways to implement this, though the current kernel >> only supports one solution, which is a hard-coded RMR (reserved >> memory region). >> >> The solution sets up a RMR region in the ACPI's IORT, which maps >> the stage1 linearly, i.e. gIOVA=gPA. >> >> The gPA=>hPA mappings in the stage-2 are done by the kernel that >> polls an IOMMU_RESV_SW_MSI region defined in the kernel driver. >> >> It's not the ideal solution, but it's the simplest to implement. >> >> There are other ways to support this like a true 2-stage mapping >> but they are still on the way. >> >> For more details, please refer to this: >> https://lore.kernel.org/all/cover.1740014950.git.nicol...@nvidia.com/ > Thanks for the link, it helps much for understanding arm smmu arch. > >>>> +static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool >> *vfio_pci) >>>> +{ >>>> + >>>> + if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) || >>>> + object_dynamic_cast(OBJECT(pdev), "pxb-pcie") || >>>> + object_dynamic_cast(OBJECT(pdev), "gpex-root")) { >>>> + return true; >>>> + } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI) && >>>> + object_property_find(OBJECT(pdev), "iommufd"))) { >>> Will this always return true? >> It won't if a vfio-pci device doesn't have the "iommufd" property? > IIUC, iommufd property is always there, just value not filled for legacy > container case. > What about checking VFIOPCIDevice.vbasedev.iommufd? > >>>> + *vfio_pci = true; >>>> + return true; >>>> + } >>>> + return false; >> Then, it returns "false" here. >> >>>> static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void >>>> *opaque, >>>> int devfn) >>>> { >>>> + PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn); >>>> SMMUState *bs = opaque; >>>> + bool vfio_pci = false; >>>> SMMUPciBus *sbus; >>>> SMMUv3AccelDevice *accel_dev; >>>> SMMUDevice *sdev; >>>> >>>> + if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) { >>>> + error_report("Device(%s) not allowed. Only PCIe root complex >>>> devices " >>>> + "or PCI bridge devices or vfio-pci endpoint >> devices >>>> with " >>>> + "iommufd as backend is allowed with >>>> arm-smmuv3,accel=on", >>>> + pdev->name); >>>> + exit(1); >>> Seems aggressive for a hotplug, could we fail hotplug instead of kill QEMU? >> Hotplug will unlikely be supported well, as it would introduce >> too much complication. >> >> With iommufd, a vIOMMU object is allocated per device (vfio). If >> the device fd (cdev) is not yet given to the QEMU. It isn't able >> to allocate a vIOMMU object when creating a VM. >> >> While a vIOMMU object can be allocated at a later stage once the >> device is hotplugged. But things like IORT mappings aren't able >> to get refreshed since the OS is likely already booted. Even an >> IOMMU capability sync via the hw_info ioctl will be difficult to >> do at the runtime post the guest iommu driver's initialization. >> >> I am not 100% sure. But I think Intel model could have a similar >> problem if the guest boots with zero cold-plugged device and then >> hot-plugs a PASID-capable device at the runtime, when the guest- >> level IOMMU driver is already inited? > For vtd we define a property for each capability we care about. > When hotplug a device, we get hw_info through ioctl and compare > host's capability with virtual vtd's property setting, if incompatible, > we fail the hotplug. > > In old implementation we sync host iommu caps into virtual vtd's cap, > but that's Naked by maintainer. The suggested way is to define property > for each capability we care and do compatibility check. > > There is a "pasid" property in virtual vtd, only when it's true, the > PASID-capable > device can work with pasid. > > Zhenzhong
I don't think this is an option not to support hotplug. I agree with Zhenzhong, we shall try to align with the way it is done on intel-iommu and study whether it also fits the needs for accelerated smmu. Thanks Eric > >> FWIW, Shameer's cover-letter has the following line: >> "At least one vfio-pci device must currently be cold-plugged to >> a PCIe root complex associated with arm-smmuv3,accel=on." >> >> Perhaps there should be a similar highlight in this smmuv3-accel >> file as well (@Shameer). >> >> Nicolin