On 20/1/26 04:37, Jason Gunthorpe wrote:
On Mon, Jan 19, 2026 at 12:00:47PM +1100, Alexey Kardashevskiy wrote:
On 18/1/26 02:43, Jason Gunthorpe wrote:
On Sat, Jan 17, 2026 at 03:54:52PM +1100, Alexey Kardashevskiy wrote:
I am trying this with TEE-IO on AMD SEV and hitting problems.
My understanding is that if you want to use SEV today you also have to
use the kernel command line parameter to force 4k IOMMU pages?
No, not only 4K. I do not enforce any page size by default so it is
"everything but 512G", only when the device is "accepted" - I unmap
everything in QEMU, "accept" the device, then map everything again
but this time IOMMU uses the (4K|2M) pagemask and takes RMP entry
sizes into account.
I mean, I'm telling you how things work in upstream right now. If you
want this to work you set the 4k only cmdline option and it
works. None of what you are describing is upstream. Upstream does not
support > 4K IOPTEs if RMP is used.
ah, that. Well, even now if you force swiotlb, then IOMMU should be able to use
huge pages. But ok, point taken.
Now, from time to time the guest will share 4K pages which makes the
host OS smash NPT's 2MB PDEs to 4K PTEs, and 2M RMP entries to 4K
RMP entries, and since the IOMMU performs RMP checks - IOMMU PDEs
have to use the same granularity as NPT and RMP.
IMHO this is a bad hardware choice, it is going to make some very
troublesome software, so sigh.
afaik the Other OS is still not using 2MB pages (or does but not much?) and
runs on the same hw :)
Sure we can force some rules in Linux to make the sw simpler though.
I mean that the HW requires multiple SW controlled tables to all be
sizes must be matched. Instead the HW should read all the tables and
compute the appropriate smallest size automatically.
Not sure I follow. IOMMU table matches the QEMU page table, it is two tables
already and IOMMU cannot just blindly use 2M PTEs if the guest is backed with
4K pages.
Doing it at mapping time doesn't seem right to me, AFAICT the RMP can
change dynamically whenever the guest decides to change the
private/shared status of memory?
The guest requests page state conversion which makes KVM change RMPs
and potentially smash huge pages, the guest only (in)validates the
RMP entry but does not change ASID+GPA+otherbits, the host does. But
yeah a race is possible here.
It is not even a "race", it is just something the VMM has to deal with
whenever the RMP changes.
My expectation for AMD was that the VMM would be monitoring the RMP
granularity and use cut or "increase/decrease page size" through
iommupt to adjust the S2 mapping so it works with these RMP
limitations.
Those don't fully exist yet, but they are in the plans.
I remember the talks about hitless smashing but in case of RMPs atomic xchg is
not enough (we have a HW engine for that).
I don't think you need hitless here, if the guest is doing
encrpyed/decrypted conversions then it can be expected to not do DMA
at the same time, or at least it is OK if DMA during this period
fails.
The guest converts only a handful of 4Ks (say, the guest userspace wants to read
certificates from guest-os->host-os->fw) and only that converted part is not
expected for DMA but the rest of 2MB page is DMA-able.
So long as the VMM gets a chance to fix the iommu before the guest
understands the RMP change is completed it would be OK.
The IOMMU HW needs to understand the change too. After I smash IO PDE, there is
a small window before smashing an RMP entry when incoming trafic may hit
not-converted part of a 2MB page and RMP check in the IOMMU will fail. That
mentioned above HW+FW engine can stall DMA for a few ms while it is smashing
things.
I'm assuming there is a VMM call involved here?
Yes.
It assumes that the VMM is continually aware of what all the RMP PTEs
look like and when they are changing so it can make the required
adjustments.
The flow would be some thing like..
1) Create an IOAS
2) Create a HWPT. If there is some known upper bound on RMP/etc page
size then limit the HWPT page size to the upper bound
3) Map stuff into the ioas
4) Build the RMP/etc and map ranges of page granularity
5) Call iommufd to adjust the page size within ranges
Say, I hotplug a device into a VM with a mix of 4K and 2M RMPs. QEMU
will ask iommufd to map everything (and that would be 2M/1G), should
then QEMU ask KVM to walk through ranges and call iommufd directly
to make IO PDEs/PTEs match RMPs?
Yes, assuming it isn't already tracking it on its own.
I mean, I have to do the KVM->iommufd part anyway when 2M->4K
smashing happens in runtime but the initial mapping could be simpler
if iommufd could check RMP.
Yeah, but then we have to implement two completely different
flows. You can't do without the above since you have to deal with
dynamic changes to the RMP by the guest.
Making it so map can happen right the first time is an
optimization. Lets get the basics and then think about optimising. I
think optimizing hot plug is not important, nor do I know how good an
optimization this would even be.
Got it.
For the time being I do bypass IOMMU and make KVM call another FW+HW DMA engine
to smash IOPDEs.
I don't even want to know what that means :\ You can't change the
IOMMU page tables owned by linux from FW or you are creating bugs.
oh but I can :) It is a FW call which takes a pointer to an 2MB IOPDE, a new
table of 4K PTEs filled with the old PDE's pfn plus offsets and then the FW
exchanges the old IOPDE with a new table and smashes the corresponding RMP, and
it suspends the DMA while doing so.
If I get it right, for other platforms, the entire IOMMU table is going to live
in a secure space so there will be similar FW calls so it is not that different.
ps. I am still curious about:
btw just realized - does the code check that the folio_size
matches IO pagesize? Or batch_to_domain() is expected to start a
new batch if the next page size is not the same as previous? With
THP, we can have a mix of page sizes"
The batch has a linear chunk of consecutive physical addreses. It has
nothing to do with folios. The batch can start and end on any physical
address so long as all addresses within the range are contiguously
mapped.
The iommu mapping logic accept contiguous physical ranges and breaks
them back down into IOPTEs. There is no direct relationship between
folio size and IOPTE construction.
Ah right, pfn_reader_first/next take care of that constructiveness. Never mind.
Thanks,
For example the iommufd selftest often has a scenario where it lucks
into maybe 16k of consecutive PFNs because that is just what the MM
does on a fresh boot. Even though they are actually 4k folios they
will be mapped into AMDv1's 16k IOPTE encoding.
Jason
--
Alexey