On 23/1/26 01:12, Jason Gunthorpe wrote:
On Thu, Jan 22, 2026 at 09:58:04PM +1100, Alexey Kardashevskiy wrote:
This issue with the RMP is no different, if you get a 2M IOPTE then
the HW should check the RMP and load in a 4K IOPTE to the IOTLB if
that is what the RMP requires.
That the HW doesn't do that means you have all these difficult
problems.
Got it. Interestingly the HW actually does that, almost. Say, for
=2MB IO pages it checks if RMP==2MB and puts a 2MB IO TLB entry if
RMP==2MB, and for 4KB..1MB IO pages - a 4K IO TLB entry and RMP==4K
check. But it does not cross the 2MB boundary in RMP. Uff :-/
Not sure I understand this limitation, how does any aligned size cross
a 2MB boundary?
Sorry, probably wrong wording. SNP allows a guest page to be backed by only a
4K or 2M host page, IOMMU always rounds page size down to the nearest 4K or 2M
boundary. 4M IO pages can work with 2M RMP but not 4K RMP.
Sounds like it was thought about, is it a HW bug some cases don't
work?
Nah, this is intentional, I just do not understand all consequences of allowing
4K RMP to work with 8MB IO page :)
on the other hand, without swiotlb, dma_map() in the guest for
untrusted device is likely to be lot less than 2MB and going to
share another handful of pages but this activity is not that rare
compared to my certificates example. If only there was a way to
somehow bundle such allocations/mappings... :-/
ARM is pushing a thing where encrypt/decrypt has to work on certain aligned
granual sizes > PAGE_SIZE, you could use that mechanism to select a 2M
size for AMD too and avoid this.
2M minimum on every DMA map?
That's a completely grotesque solution!
It violates all of our software layers. The IOMMU and RMP are not
controled by the same software entity and you propose to have a FW
call that edits *both* together somehow? How is that even going to
work safely?
Can't you do things in a sequence?
Change the iommu from 2M to 4K, flush, then change the RMP from 2M to
4K?
Sure we could unless there is ongoing DMA between "flush" and "then
change" and then DMA will fail because of mismatching page sizes
(that 2MB crossing thing above).
I'm confused, if the IOMMU has 4K and the RMP has 2M it doesn't work?
I have not tried this, IOMMU pages are usually the biggest on AMD platform,
often 8MB.
Then why was I told the 4k page size kernel parameter fixes
everything?
Because IOMMU becomes 4K only and there is no huge page support in the
confidential KVM yet (well, in the upstream linux) so page size mismatch cannot
occur.
What happens if the guest puts 4K pages into it's AMDv2 table and RMP
is 2M?
Is this AMDv2 - an NPT (then it is going to fail)? or nested IOMMU (never
tried, in the works, I suspect failure)?
If I get it right, for other platforms, the entire IOMMU table is
going to live in a secure space so there will be similar FW calls so
it is not that different.
At least ARM the iommu S2 table is in secure memory and the secure FW
keeps it 1:1 with the KVM S2 table. So edits to the KVM automatically
make matching edits to the IOMMU. Only one software layer is
responsible for things.
?
Does KVM talk to the host IOMMU code for that (and then the IOMMU code calls
the secure world)?
Or KVM goes straight to that secure world?
Straight to the secure world, there is no host IOMMU driver for the
secure IOMMU.
QEMU will try mapping all guest memory and will call the host for this, or it
won't, on ARM? No IOMMUFD in this case? Always guest-visible IOMMU? Thanks,
Is the host IOMMU code aware of the content of the secure IOMMU table?
No, it isn't even aware it exist.
Does 2MB->4K smashing exist on ARM at all?
Every arch has cases where larger mappings need to be reduced to
smaller ones, but ARM doesn't require synchronized coordination
between multiple tables.
Jason
--
Alexey