On Mon, Jan 19, 2026 at 12:00:47PM +1100, Alexey Kardashevskiy wrote: > On 18/1/26 02:43, Jason Gunthorpe wrote: > > On Sat, Jan 17, 2026 at 03:54:52PM +1100, Alexey Kardashevskiy wrote: > > > > > I am trying this with TEE-IO on AMD SEV and hitting problems. > > > > My understanding is that if you want to use SEV today you also have to > > use the kernel command line parameter to force 4k IOMMU pages? > > No, not only 4K. I do not enforce any page size by default so it is > "everything but 512G", only when the device is "accepted" - I unmap > everything in QEMU, "accept" the device, then map everything again > but this time IOMMU uses the (4K|2M) pagemask and takes RMP entry > sizes into account.
I mean, I'm telling you how things work in upstream right now. If you want this to work you set the 4k only cmdline option and it works. None of what you are describing is upstream. Upstream does not support > 4K IOPTEs if RMP is used. > > > Now, from time to time the guest will share 4K pages which makes the > > > host OS smash NPT's 2MB PDEs to 4K PTEs, and 2M RMP entries to 4K > > > RMP entries, and since the IOMMU performs RMP checks - IOMMU PDEs > > > have to use the same granularity as NPT and RMP. > > > > IMHO this is a bad hardware choice, it is going to make some very > > troublesome software, so sigh. > > afaik the Other OS is still not using 2MB pages (or does but not much?) and > runs on the same hw :) > > Sure we can force some rules in Linux to make the sw simpler though. I mean that the HW requires multiple SW controlled tables to all be sizes must be matched. Instead the HW should read all the tables and compute the appropriate smallest size automatically. > > Doing it at mapping time doesn't seem right to me, AFAICT the RMP can > > change dynamically whenever the guest decides to change the > > private/shared status of memory? > > The guest requests page state conversion which makes KVM change RMPs > and potentially smash huge pages, the guest only (in)validates the > RMP entry but does not change ASID+GPA+otherbits, the host does. But > yeah a race is possible here. It is not even a "race", it is just something the VMM has to deal with whenever the RMP changes. > > My expectation for AMD was that the VMM would be monitoring the RMP > > granularity and use cut or "increase/decrease page size" through > > iommupt to adjust the S2 mapping so it works with these RMP > > limitations. > > > > Those don't fully exist yet, but they are in the plans. > > I remember the talks about hitless smashing but in case of RMPs atomic xchg > is not enough (we have a HW engine for that). I don't think you need hitless here, if the guest is doing encrpyed/decrypted conversions then it can be expected to not do DMA at the same time, or at least it is OK if DMA during this period fails. So long as the VMM gets a chance to fix the iommu before the guest understands the RMP change is completed it would be OK. I'm assuming there is a VMM call involved here? > > It assumes that the VMM is continually aware of what all the RMP PTEs > > look like and when they are changing so it can make the required > > adjustments. > > > > The flow would be some thing like.. > > 1) Create an IOAS > > 2) Create a HWPT. If there is some known upper bound on RMP/etc page > > size then limit the HWPT page size to the upper bound > > 3) Map stuff into the ioas > > 4) Build the RMP/etc and map ranges of page granularity > > 5) Call iommufd to adjust the page size within ranges > > Say, I hotplug a device into a VM with a mix of 4K and 2M RMPs. QEMU > will ask iommufd to map everything (and that would be 2M/1G), should > then QEMU ask KVM to walk through ranges and call iommufd directly > to make IO PDEs/PTEs match RMPs? Yes, assuming it isn't already tracking it on its own. > I mean, I have to do the KVM->iommufd part anyway when 2M->4K > smashing happens in runtime but the initial mapping could be simpler > if iommufd could check RMP. Yeah, but then we have to implement two completely different flows. You can't do without the above since you have to deal with dynamic changes to the RMP by the guest. Making it so map can happen right the first time is an optimization. Lets get the basics and then think about optimising. I think optimizing hot plug is not important, nor do I know how good an optimization this would even be. > For the time being I do bypass IOMMU and make KVM call another FW+HW DMA > engine to smash IOPDEs. I don't even want to know what that means :\ You can't change the IOMMU page tables owned by linux from FW or you are creating bugs. > ps. I am still curious about: > > > btw just realized - does the code check that the folio_size > > matches IO pagesize? Or batch_to_domain() is expected to start a > > new batch if the next page size is not the same as previous? With > > THP, we can have a mix of page sizes" The batch has a linear chunk of consecutive physical addreses. It has nothing to do with folios. The batch can start and end on any physical address so long as all addresses within the range are contiguously mapped. The iommu mapping logic accept contiguous physical ranges and breaks them back down into IOPTEs. There is no direct relationship between folio size and IOPTE construction. For example the iommufd selftest often has a scenario where it lucks into maybe 16k of consecutive PFNs because that is just what the MM does on a fresh boot. Even though they are actually 4k folios they will be mapped into AMDv1's 16k IOPTE encoding. Jason
