On 2025/6/17 20:37, Jason Gunthorpe wrote:
On Mon, Jun 16, 2025 at 08:14:27PM -0700, Nicolin Chen wrote:
On Mon, Jun 16, 2025 at 08:15:11AM +0000, Duan, Zhenzhong wrote:
IIUIC, the guest kernel cmdline can switch the mode between the
stage1 (nesting) and stage2 (legacy/emulated VT-d), right?
Right. E.g., kexec from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off",
Then first kernel will run in scalable mode and use stage1(nesting) and
second kernel will run in legacy mode and use stage2.
In scalable mode, guest kernel has a stage1 (nested) domain and
host kernel has a stage2 (nesting parent) domain. In this case,
the VFIO container IOAS could be the system AS corresponding to
the kernel-managed stage2 domain.
In legacy mode, guest kernel has a stage2 (normal) domain while
host kernel has a stage2 (shadow) domain? In this case, the VFIO
container IOAS should be the iommu AS corresponding to the kernel
guest-level stage2 domain (or should it be shadow)?
What you want is to disable HW support for legacy mode in qemu so the
kernel rejects sm_off operation.
that can be the future. :)
The HW spec is really goofy, we get an ecap_slts but it only applies
to a PASID table entry (scalable mode). So the HW has to support
second stage for legacy always but can turn it off for PASID?
yes. legacy mode (page table following second stage format) is anyhow
supported.
IMHO the intention was to allow the VMM to not support shadowing, but
it seems the execution was mangled.
I suggest fixing the Linux driver to refuse to run in sm_on mode if
the HW supports scalable mode and ecap_slts = false. That may not be
100% spec compliant but it seems like a reasonable approach.
running sm_on with only ecap_flts==true is what we want here. We want
the guest use stage-1 page table hence it can be used by hw under the
nested translation mode. While this page table is only available in sm_on
mode.
If we want to drop the legacy mode usage in virtualization environment, we
might let linux iommu driver refuse running legacy mode while ecap_slts is
false. I suppose HW is going to advertise both ecap_slts and ecap_flts. So
this will just let guest get rid of using legacy mode.
But this is not necessary so far. As the discussion going here, we intend
to reuse the GPA HWPT allocated by VFIO container as well.[1] This is now
aligned with Nic and Shameer.
[1]
https://lore.kernel.org/qemu-devel/b3d31287-4de5-4e0e-a81b-99f82edd5...@intel.com/
The ARM model that Shameer is proposing only allows a nested SMMU
when such a legacy mode is off. This simplifies a lot of things.
But the difficulty of the VT-d model is that it has to rely on a
guest bootcmd during runtime..
ARM is cleaner because it doesn't have these drivers issues. qemu can
reliably say not to use the S2 and all the existing guest kernels will
obey that.
out of curious, does SMMU have legacy mode or a given version of SMMU
only supports either legacy mode or newer mode?
AMD has the same issues, BTW, arguably even worse as I didn't notice
any way to specify if the v1 page table is supported :\
Jason
--
Regards,
Yi Liu