On Tue, 12 May 2026 12:25:45 -0500
Tushar Dave <[email protected]> wrote:

> On 5/11/2026 6:43 AM, Ard Biesheuvel wrote:
> > Hello Tushar,
> > 
> > On Fri, 8 May 2026, at 20:37, Tushar Dave via groups.io wrote:  
> >> This RFC introduces a mechanism to specify Guest Physical Addresses
> >> (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR
> >> addresses to match host physical addresses for assigned devices.
> >>
> >> On some platforms, P2P DMA is performed between devices within the same
> >> IOMMU group. The PCI fabric ACS is configured to permit direct P2P
> >> without going through the host bridge in order to achieve the required
> >> performance.
> >>
> >> To support this multi-device IOMMU group P2P scenario in virtualization,
> >> the VM may need to use the same MMIO BAR addresses as the host physical
> >> address layout.
> >>  
> > 
> > Did you consider implementing this using Enhanced Allocation (EA)? If so,
> > could you explain why it is not suitable here?  
> 
> I have not evaluated EA for this design. When I looked at EDK2, I
> chose PcdPciDisableBusEnumeration because it cleanly preserves fixed
> BAR programming established by the hypervisor — at the cost of QEMU
> performing PCI bus number and resource assignment.
> 
> I did a quick search and do not see EA support in EDK2. Any pointers
> to EA being used in a similar fashion to achieve fixed BAR placement
> would be appreciated.

EA wasn't on my radar either, but I did some research and chatted with
Tushar and I think it could work.  I'll sketch out a rough idea of what
it might looks like.

EA describes BAR equivalents (fixed base address, size, and type) in a
separate capability while the corresponding device BAR registers appear
unimplemented.  Linux already consumes endpoint EA capabilities and
marks the resulting resources IORESOURCE_PCI_FIXED.  EDK2 doesn't know
about EA (cap 0x14 isn't defined anywhere in MdePkg, and PciBusDxe
never consults it afaict), but that turns out to be useful here rather
than a problem.

Starting at the QEMU device, for a vfio-pci device we'd need to
virtualize the real BARs as unimplemented and surface that information
via a synthesized EA capability instead.  It's debatable whether this
is a generic PCI mechanism or vfio-pci specific, whether HPA is
automatically used as the base address for vfio-pci devices or
user-specified, and the capability offset in config space.  None of
those fundamentally change the shape of the flow.

For the absolute bare-minimum level of support (EA device on the root
complex, EA resources don't overlap the VM address space or MMIO range,
EDK2 firmware, Linux guest booted with pci=nocrs) I think this actually
works with just adding the EA capability above.  Let's walk through
those constraints and how we relax them.

At the firmware level we lean on the real BAR registers being
unimplemented for EA devices, so EDK2 allocates no MMIO or IO resources
for them.  Only bus numbers get assigned if the EA device sits in a PCI
hierarchy.  That's exactly what we want, EDK2 doing conventional bus
assignment but staying out of the EA resource flow entirely.

Instead of firmware EA enlightenment we lean on the guest OS.  Linux
reads endpoint EA today, but the bridge aperture sizing path ignores
those fixed resources.  As Tushar's series demonstrates, generically
handling mixed "fixed-BAR" and programmable-BAR devices in one
hierarchy is hard.  An incremental Linux enhancement that greatly
simplifies the problem space would be to program bridge apertures only
for hierarchies consisting entirely of fixed resources.  The math
becomes trivial (window spans min..max of fixed children, aligned to
bridge granularity), and there's no regression risk, these hierarchies
currently fail silently.  The sizer ignores fixed children and the
fixed-claim walk-up finds no containing parent.  This enhancement,
plus the homogeneous-hierarchy constraint, removes the root-complex
constraint and lets us mirror the bare-metal topologies we need.

Resource ranges are a bit messier.  The extent of the EA device ranges
could be determined in QEMU and the VM address map adjusted to prevent
overlap.  Tushar already has a similar user-specified machine option in
this series.  That range also needs to reach the guest as a CRS (to
avoid pci=nocrs) but needs to stay distinct from the DT range passed to
EDK2 for programmable BAR devices so EDK2 won't place a programmable
BAR or bridge window into the EA region.  So long as we keep EA and
programmable devices in separate hierarchies, EDK2 only needs the
programmable range via DT and we can add the EA range as additional CRS
ranges visible only to the guest.

In practice, EDK2 programs all the programmable devices and the EA
devices live entirely in the additional CRS.  A possibly cleaner
alternative is additional PXB host bridges for the EA devices, each
with its own CRS.  That sidesteps the DT/CRS split entirely since the
EA PXB has nothing for EDK2 to allocate anyway.

If we agree that homogeneous hierarchies (no mixing of EA and
programmable BARs) is a reasonable constraint, and possibly extend that
to homogeneous per host bridge to simplify the CRS mapping, we have the
following work items:

 * Extend Linux EA support to program bridge apertures for subordinate
   homogeneous EA hierarchies.

 * Develop options to virtualize programmable BARs as EA for vfio-pci
   devices, if not generically for the benefit of testing.

 * Implement a way to poke holes in the VM address space and plumb
   through to account for addresses used by EA devices.

 * Provide those same ranges to the guest via CRS (but not via DT to
   EDK2), or alternatively expose them through additional PXB host
   bridges.

Does that shape roughly seem accurate?  Are there additional gaps I've
missed?  Thanks,

Alex


-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#121945): https://edk2.groups.io/g/devel/message/121945
Mute This Topic: https://groups.io/mt/119221703/21656
Group Owner: [email protected]
Unsubscribe: https://edk2.groups.io/g/devel/unsub [[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-


Reply via email to