On 22/1/26 04:09, Jason Gunthorpe wrote:
On Wed, Jan 21, 2026 at 12:08:19PM +1100, Alexey Kardashevskiy wrote:
I mean that the HW requires multiple SW controlled tables to all be
sizes must be matched. Instead the HW should read all the tables and
compute the appropriate smallest size automatically.

Not sure I follow. IOMMU table matches the QEMU page table, it is
two tables already and IOMMU cannot just blindly use 2M PTEs if the
guest is backed with 4K pages.

That is just because AMD HW can't handle it.

For example if you look at the CPU when the guest S1 page table has a
1G PTE and the KVM S2 has a 2M PTE the CPU doesn't explode, it walks
the S1, walks the S2 and loads a 2M PTE into the TLB.

This issue with the RMP is no different, if you get a 2M IOPTE then
the HW should check the RMP and load in a 4K IOPTE to the IOTLB if
that is what the RMP requires.
That the HW doesn't do that means you have all these difficult
problems.

Got it. Interestingly the HW actually does that, almost. Say, for >=2MB IO 
pages it checks if RMP==2MB and puts a 2MB IO TLB entry if RMP==2MB, and for 
4KB..1MB IO pages - a 4K IO TLB entry and RMP==4K check. But it does not cross the 
2MB boundary in RMP. Uff :-/

I don't think you need hitless here, if the guest is doing
encrpyed/decrypted conversions then it can be expected to not do DMA
at the same time, or at least it is OK if DMA during this period
fails.

The guest converts only a handful of 4Ks (say, the guest userspace
wants to read certificates from guest-os->host-os->fw) and only that
converted part is not expected for DMA but the rest of 2MB page is
DMA-able.

Yes, that's very true!


on the other hand, without swiotlb, dma_map() in the guest for untrusted device 
is likely to be lot less than 2MB and going to share another handful of pages 
but this activity is not that rare compared to my certificates example. If only 
there was a way to somehow bundle such allocations/mappings... :-/


So long as the VMM gets a chance to fix the iommu before the guest
understands the RMP change is completed it would be OK.

The IOMMU HW needs to understand the change too. After I smash IO
PDE, there is a small window before smashing an RMP entry when
incoming trafic may hit not-converted part of a 2MB page and RMP
check in the IOMMU will fail. That mentioned above HW+FW engine can
stall DMA for a few ms while it is smashing things.

oh but I can :) It is a FW call which takes a pointer to an 2MB
IOPDE, a new table of 4K PTEs filled with the old PDE's pfn plus
offsets and then the FW exchanges the old IOPDE with a new table and
smashes the corresponding RMP, and it suspends the DMA while doing
so.

That's a completely grotesque solution!

It violates all of our software layers. The IOMMU and RMP are not
controled by the same software entity and you propose to have a FW
call that edits *both* together somehow? How is that even going to
work safely?

Can't you do things in a sequence?

Change the iommu from 2M to 4K, flush, then change the RMP from 2M to
4K?

Sure we could unless there is ongoing DMA between "flush" and "then change" and 
then DMA will fail because of mismatching page sizes (that 2MB crossing thing above).

If I get it right, for other platforms, the entire IOMMU table is
going to live in a secure space so there will be similar FW calls so
it is not that different.

At least ARM the iommu S2 table is in secure memory and the secure FW
keeps it 1:1 with the KVM S2 table. So edits to the KVM automatically
make matching edits to the IOMMU. Only one software layer is
responsible for things.
Does KVM talk to the host IOMMU code for that (and then the IOMMU code calls 
the secure world)?
Or KVM goes straight to that secure world?

Is the host IOMMU code aware of the content of the secure IOMMU table?

Does 2MB->4K smashing exist on ARM at all?

(I'll ask these on the IOMMUFD community call tomorrow too).

That is *very* different from saying that kvm or iommu has to go and
reach into the other subsystem and edit their in-memory structures.

Currently kvm has no idea about the iommu.

So if you want to make use of that you have to solve this fundamental
issue that we can't issue the FW call without some security
synchronization and locking between KVM and iommu.

This is true. Thanks,



Jason

--
Alexey


Reply via email to