smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Eric Auger Fri, 05 Sep 2025 01:16:17 -0700

Hi,

On 7/16/25 8:26 AM, Duan, Zhenzhong wrote:
>
>> -----Original Message-----
>> From: Nicolin Chen <nicol...@nvidia.com>
>> Subject: Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
>> accelerated SMMUv3 to vfio-pci endpoints with iommufd
>>
>> On Tue, Jul 15, 2025 at 10:53:50AM +0000, Duan, Zhenzhong wrote:
>>>
>>>> -----Original Message-----
>>>> From: Shameer Kolothum <shameerali.kolothum.th...@huawei.com>
>>>> Subject: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
>> accelerated
>>>> SMMUv3 to vfio-pci endpoints with iommufd
>>>>
>>>> Accelerated SMMUv3 is only useful when the device can take advantage of
>>>> the host's SMMUv3 in nested mode. To keep things simple and correct, we
>>>> only allow this feature for vfio-pci endpoint devices that use the iommufd
>>>> backend. We also allow non-endpoint emulated devices like PCI bridges
>> and
>>>> root ports, so that users can plug in these vfio-pci devices.
>>>>
>>>> Another reason for this limit is to avoid problems with IOTLB
>>>> invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an
>> associated
>>>> SID, making it difficult to trace the originating device. If we allowed
>>>> emulated endpoint devices, QEMU would have to invalidate both its own
>>>> software IOTLB and the host's hardware IOTLB, which could slow things
>>>> down.
>>>>
>>>> Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
>>>> translation (S1+S2), their get_address_space() callback must return the
>>>> system address space to enable correct S2 mappings of guest RAM.
>>>>
>>>> So in short:
>>>> - vfio-pci devices return the system address space
>>>> - bridges and root ports return the IOMMU address space
>>>>
>>>> Note: On ARM, MSI doorbell addresses are also translated via SMMUv3.
>>> So the translation result is a doorbell addr(gpa) for guest?
>>> IIUC, there should be a mapping between guest doorbell addr(gpa) to host
>>> doorbell addr(hpa) in stage2 page table? Where is this mapping setup?
>> Yes and yes.
>>
>> On ARM, MSI is behind IOMMU. When 2-stage translation is enabled,
>> it goes through two stages as you understood.
>>
>> There are a few ways to implement this, though the current kernel
>> only supports one solution, which is a hard-coded RMR (reserved
>> memory region).
>>
>> The solution sets up a RMR region in the ACPI's IORT, which maps
>> the stage1 linearly, i.e. gIOVA=gPA.
>>
>> The gPA=>hPA mappings in the stage-2 are done by the kernel that
>> polls an IOMMU_RESV_SW_MSI region defined in the kernel driver.
>>
>> It's not the ideal solution, but it's the simplest to implement.
>>
>> There are other ways to support this like a true 2-stage mapping
>> but they are still on the way.
>>
>> For more details, please refer to this:
>> https://lore.kernel.org/all/cover.1740014950.git.nicol...@nvidia.com/
> Thanks for the link, it helps much for understanding arm smmu arch.
>
>>>> +static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool
>> *vfio_pci)
>>>> +{
>>>> +
>>>> +    if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
>>>> +        object_dynamic_cast(OBJECT(pdev), "pxb-pcie") ||
>>>> +        object_dynamic_cast(OBJECT(pdev), "gpex-root")) {
>>>> +        return true;
>>>> +    } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI) &&
>>>> +        object_property_find(OBJECT(pdev), "iommufd"))) {
>>> Will this always return true?
>> It won't if a vfio-pci device doesn't have the "iommufd" property?
> IIUC, iommufd property is always there, just value not filled for legacy 
> container case.
> What about checking VFIOPCIDevice.vbasedev.iommufd?
>
>>>> +        *vfio_pci = true;
>>>> +        return true;
>>>> +    }
>>>> +    return false;
>> Then, it returns "false" here.
>>
>>>> static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void
>>>> *opaque,
>>>>                                               int devfn)
>>>> {
>>>> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>>>>     SMMUState *bs = opaque;
>>>> +    bool vfio_pci = false;
>>>>     SMMUPciBus *sbus;
>>>>     SMMUv3AccelDevice *accel_dev;
>>>>     SMMUDevice *sdev;
>>>>
>>>> +    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
>>>> +        error_report("Device(%s) not allowed. Only PCIe root complex
>>>> devices "
>>>> +                     "or PCI bridge devices or vfio-pci endpoint
>> devices
>>>> with "
>>>> +                     "iommufd as backend is allowed with
>>>> arm-smmuv3,accel=on",
>>>> +                     pdev->name);
>>>> +        exit(1);
>>> Seems aggressive for a hotplug, could we fail hotplug instead of kill QEMU?
>> Hotplug will unlikely be supported well, as it would introduce
>> too much complication.
>>
>> With iommufd, a vIOMMU object is allocated per device (vfio). If
>> the device fd (cdev) is not yet given to the QEMU. It isn't able
>> to allocate a vIOMMU object when creating a VM.
>>
>> While a vIOMMU object can be allocated at a later stage once the
>> device is hotplugged. But things like IORT mappings aren't able
>> to get refreshed since the OS is likely already booted. Even an
>> IOMMU capability sync via the hw_info ioctl will be difficult to
>> do at the runtime post the guest iommu driver's initialization.
>>
>> I am not 100% sure. But I think Intel model could have a similar
>> problem if the guest boots with zero cold-plugged device and then
>> hot-plugs a PASID-capable device at the runtime, when the guest-
>> level IOMMU driver is already inited?
> For vtd we define a property for each capability we care about.
> When hotplug a device, we get hw_info through ioctl and compare
> host's capability with virtual vtd's property setting, if incompatible,
> we fail the hotplug.
>
> In old implementation we sync host iommu caps into virtual vtd's cap,
> but that's Naked by maintainer. The suggested way is to define property
> for each capability we care and do compatibility check.
>
> There is a "pasid" property in virtual vtd, only when it's true, the 
> PASID-capable
> device can work with pasid.
>
> Zhenzhong


I don't think this is an option not to support hotplug. I agree with
Zhenzhong, we shall try to align with the way it is done on intel-iommu
and study whether it also fits the needs for accelerated smmu.

Thanks

Eric
>
>> FWIW, Shameer's cover-letter has the following line:
>> "At least one vfio-pci device must currently be cold-plugged to
>>  a PCIe root complex associated with arm-smmuv3,accel=on."
>>
>> Perhaps there should be a similar highlight in this smmuv3-accel
>> file as well (@Shameer).
>>
>> Nicolin

Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Reply via email to