OK. Let me clarify this at the top as I see the gap here now:

First, the vSMMU model is based on Zhenzhong's older series that
keeps an ioas_id in the HostIOMMUDeviceIOMMUFD structure, which
now it only keeps an hwpt_id in this RFCv3 series. This ioas_id
is allocated when a passthrough cdev attaches to a VFIO container.

Second, the vSMMU model reuses the default IOAS via that ioas_id.
Since the VFIO container doesn't allocate a nesting parent S2 HWPT
(maybe it could?), so the vSMMU allocates another S2 HWPT in the
vIOMMU code.

Third, the vSMMU model, for invalidation efficiency and HW Queue
support, isolates all emulated devices out of the nesting-enabled
vSMMU instance, suggested by Jason. So, only passthrough devices
would use the nesting-enabled vSMMU instance, meaning there is no
need of IOMMU_NOTIFIER_IOTLB_EVENTS:
 - MAP is not needed as there is no shadow page table. QEMU only
   traps the page table pointer and forwards it to host kernel.
 - UNMAP is not needed as QEMU only traps invalidation requests
   and forwards them to host kernel.

(let's forget about the "address space switch" for MSI for now.)

So, in the vSMMU model, there is actually no need for the iommu
AS. And there is only one IOAS in the VM instance allocated by the
VFIO container. And this IOAS manages the GPA->PA mappings. So,
get_address_space() returns the system AS for passthrough devices.

On the other hand, the VT-d model is a bit different. It's a giant
vIOMMU for all devices (either passthrough or emualted). For all
emulated devices, it needs IOMMU_NOTIFIER_IOTLB_EVENTS, i.e. the
iommu address space returned via get_address_space().

That being said, IOMMU_NOTIFIER_IOTLB_EVENTS should not be needed
for passthrough devices, right?

IIUIC, in the VT-d model, a passthrough device also gets attached
to the VFIO container via iommufd_cdev_attach, allocating an IOAS.
But it returns the iommu address space, treating them like those
emulated devices, although the underlying MR of the returned IOMMU
AS is backed by a nodmar MR (that is essentially a system AS).

This seems to completely ignore the default IOAS owned by the VFIO
container, because it needs to bypass those RO mappings(?)

Then for passthrough devices, the VT-d model allocates an internal
IOAS that further requires an internal S2 listener, which seems an
large duplication of what the VFIO container already does..

So, here are things that I want us to conclude:
 1) Since the VFIO container already has an IOAS for a passthrough
    device, and IOMMU_NOTIFIER_IOTLB_EVENTS isn't seemingly needed,
    why not setup this default IOAS to manage gPA=>PA mappings by
    returning the system AS via get_address_space() for passthrough
    devices?

    I got that the VT-d model might have some concern against this,
    as the default listener would map those RO regions. Yet, maybe
    the right approach is to figure out a way to bypass RO regions
    in the core v.s. duplicating another ioas_alloc()/map() and S2
    listener?

 2) If (1) makes sense, I think we can further simplify the routine
    by allocating a nesting parent HWPT in iommufd_cdev_attach(),
    as long as the attaching device is identified as "passthrough"
    and there is "iommufd" in its "-device" string?

    After all, IOMMU_HWPT_ALLOC_NEST_PARENT is a common flag.

On Mon, May 26, 2025 at 03:24:50PM +0800, Yi Liu wrote:
> vfio_listener_region_add, section->mr->name: pc.bios, iova: fffc0000, size:
> 40000, vaddr: 7fb314200000, RO
> vfio_listener_region_add, section->mr->name: pc.rom, iova: c0000, size:
> 20000, vaddr: 7fb206c00000, RO
..
> vfio_listener_region_add, section->mr->name: pc.ram, iova: ce000, size:
> 1a000, vaddr: 7fb207ece000, RO

OK. They look like memory carveouts for FWs. "iova" is gPA right?

And they can be in the range of a guest RAM..

Mind elaborating why they shouldn't be mapped onto nesting parent
S2?

> IMHO. At least for vfio devices, I can see only one get_address_space()
> call. So even there are two ASs, how should the vfio be notified when the
> AS changed? Since vIOMMU is the source of map/umap requests, it looks fine
> to always return iommu AS and handle the AS switch by switching the enabled
> subregions according to the guest vIOMMU translation types.

No, VFIO doesn't get notified when the AS changes.

The vSMMU model wants VFIO to stay in the system AS since the VFIO
container manages the S2 mappings for guest PA.

The "switch" in vSMMU model is only needed by KVM for MSI doorbell
translation. By thinking it carefully, maybe it shouldn't switch AS
because VFIO might be confused if it somehow does get_address_space
again in the future..

Thanks
Nic

Reply via email to