unbind guest page table to host

Yi Liu Mon, 26 May 2025 00:21:07 -0700

On 2025/5/24 05:12, Nicolin Chen wrote:

Hey,


Thanks for the reply.

Just want to say that I am asking a lot to understand why VT-d is
different than ARM, so as to decide whether ARM should follow VT-d
implementing a separate listener or just use the VFIO listener.

On Fri, May 23, 2025 at 02:22:15PM +0800, Yi Liu wrote:

Hey Nic,

On 2025/5/22 06:49, Nicolin Chen wrote:

On Wed, May 21, 2025 at 07:14:45PM +0800, Zhenzhong Duan wrote:

+static const MemoryListener iommufd_s2domain_memory_listener = {
+    .name = "iommufd_s2domain",
+    .priority = 1000,
+    .region_add = iommufd_listener_region_add_s2domain,
+    .region_del = iommufd_listener_region_del_s2domain,
+};


Would you mind elaborating When and how vtd does all S2 mappings?

On ARM, the default vfio_memory_listener could capture the entire
guest RAM and add to the address space. So what we do is basically
reusing the vfio_memory_listener:
https://lore.kernel.org/qemu-devel/20250311141045.66620-13-shameerali.kolothum.th...@huawei.com/


in concept yes, all the guest ram. but due to an errata, we need
to skip the RO mappings.


Mind elaborating what are RO mappings? Can those be possible within
the range of the RAM?

Below are RO regions when booting Q35 machine (this is the pcie capableplatform and also vIOMMU capable), 4GB memory. For the bios and rom

regions, it looks reasonable. I'm not quite sure why there is RO RAM yet.
But it seems to be the fact we need to face.

vfio_listener_region_add, section->mr->name: pc.bios, iova: fffc0000, size:40000, vaddr: 7fb314200000, ROvfio_listener_region_add, section->mr->name: pc.rom, iova: c0000, size:20000, vaddr: 7fb206c00000, ROvfio_listener_region_add, section->mr->name: pc.bios, iova: e0000, size:20000, vaddr: 7fb314220000, ROvfio_listener_region_add, section->mr->name: pc.rom, iova: d8000, size:8000, vaddr: 7fb206c18000, ROvfio_listener_region_add, section->mr->name: pc.bios, iova: e0000, size:10000, vaddr: 7fb314220000, ROvfio_listener_region_add, section->mr->name: vga.rom, iova: febc0000, size:10000, vaddr: 7fb205800000, ROvfio_listener_region_add, section->mr->name: virtio-net-pci.rom, iova:feb80000, size: 40000, vaddr: 7fb205600000, ROvfio_listener_region_add, section->mr->name: pc.ram, iova: c0000, size:b000, vaddr: 7fb207ec0000, ROvfio_listener_region_add, section->mr->name: pc.ram, iova: ce000, size:a000, vaddr: 7fb207ece000, ROvfio_listener_region_add, section->mr->name: pc.ram, iova: f0000, size:10000, vaddr: 7fb207ef0000, ROvfio_listener_region_add, section->mr->name: pc.ram, iova: ce000, size:1a000, vaddr: 7fb207ece000, RO

The thing is that when a VFIO device is attached to the container
upon a nesting configuration, the ->get_address_space op should
return the system address space as S1 nested HWPT isn't allocated
yet. Then all the iommu as routines in vfio_listener_region_add()
would be skipped, ending up with mapping the guest RAM in S2 HWPT
correctly. Not until the S1 nested HWPT is allocated by the guest
OS (after guest boots), can the ->get_address_space op return the
iommu address space.


This seems a bit different between ARM and VT-d emulation. The VT-d
emulation code returns the iommu address space regardless of what
translation mode guest configured. But the MR of the address space
has two overlapped subregions, one is nodmar, another one is iommu.
As the naming shows, the nodmar is aliased to the system MR.


OK. But why two overlapped subregions v.s. two separate two ASs?


TBH. I don't have the exact reason about it. +Cc Peter if he remembers
it or not.

IMHO. At least for vfio devices, I can see only one get_address_space()
call. So even there are two ASs, how should the vfio be notified when the
AS changed? Since vIOMMU is the source of map/umap requests, it looks fine
to always return iommu AS and handle the AS switch by switching the enabled
subregions according to the guest vIOMMU translation types.

And before
the guest enables iommu and set PGTT to a non-PT mode (e.g. S1 or S2),
the effective MR alias is the nodmar, hence the mapping this address
space holds are the GPA mappings in the beginning.


I think this is same on ARM, where get_address_space() may return
system address space. And for VT-d, it actually returns the range
of the system address space (just though a sub MR of an iommu AS),
right?


hmmm, I'm not quite getting why it is similar. As I replied, the VT-d
emulation code returns iommu AS in get_address_space(). I didn't see
where it returns address_space_memory (the system address space).

If guest set PGTT to S2,
then the iommu MR is enabled, hence the mapping is gIOVA mappings
accordingly. So in VT-d emulation, the address space switch is more the MR
alias switching.


Zhenzhong said that there is no shadow page table for the nesting
setup, i.e. gIOVA=>gPA mappings are entirely done by the guest OS.

Then, why does VT-d need to switch to the iommu MR here?


what I described in prior email is the general idea of the AS switching
before this series. nesting for sure does not need this switching just like
PT.

In this series, we mainly want to support S1 translation type for guest.
And it is based on nested translation, which needs a S2 domain that holds
the GPA mappings. Besides S1 translation type, PT is also supported. Both
the two types need a S2 domain which already holds GPA mappings. So we have
this internal listener.


Hmm, the reasoning to the last "so" doesn't sound enough. The VFIO
listener could do the same...


yes. I just realized that RO mappings should be allowed for the normal
S2 domains. Only the nested parent S2 domain should skip the RO mappings.

Also, we want to skip RO mappings on S2, so that's
another reason for it.  @Zhenzhong, perhaps, it can be described in the
commit message why an internal listener is introduced.


OK. I think that can be a good reason to have an internal listener,
only if VFIO can't skip the RO mappings.

So the second question is:
Does vtd have to own this iommufd_s2domain_memory_listener? IOW,


yes based on the current design. when guest GPTT==PT, attach device
to S2 hwpt, when it goes to S1, then attach it to a S1 hwpt whose
parent is the aforementioned S2 hwpt. This S2 hwpt is always there
for use.


ARM is doing the same thing. And the exact point "this S2 hwpt is
always there for use" has been telling me that the device can just
stay at the S2 address space (system), since the guest kernel will
take care of the S1 address space (iommu).

Overall, the questions here have been two-fold:

1.Why does VT-d need an internal listener?

   I can see the (only) reason is for the RO mappings.

   Yet, Is there anything that we can do to the VFIO listener to
   bypass these RO mappings?

2.Why not return the system AS all the time when nesting is on?
   Why switch to the iommu AS when device attaches to S1 HWPT?


no switch if going to setup nesting.

Just got a question on ARM side. IIUC. The ARM emulation code will return
the system address space in the get_address_space() op before guest enables
vIOMMU. Hence the IOAS in the vfio side is GPA IOAS. When guest enables
vIOMMU, the emulation will return iommu address space. Hence, the vfio side
needs switch to gIOVA IOAS? My question is if guest is setting S1
translation, and the emulation code figures out it is going to set up
nested translation, will the get_address_space() op return the iommu
address space as well? If so, where is the GPA IOAS locates? In this
series, the VT-d emulation code actually has an internal GPA IOAS which
skips RO mappings.

--
Regards,
Yi Liu

Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host

Reply via email to