On Tue, May 12, 2026 at 05:06:50PM -0600, Alex Williamson wrote:
> On Tue, 12 May 2026 12:25:45 -0500
> Tushar Dave <[email protected]> wrote:
> 
> > On 5/11/2026 6:43 AM, Ard Biesheuvel wrote:
> > > Hello Tushar,
> > > 
> > > On Fri, 8 May 2026, at 20:37, Tushar Dave via groups.io wrote:  
> > >> This RFC introduces a mechanism to specify Guest Physical Addresses
> > >> (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR
> > >> addresses to match host physical addresses for assigned devices.
> > >>
> > >> On some platforms, P2P DMA is performed between devices within the same
> > >> IOMMU group. The PCI fabric ACS is configured to permit direct P2P
> > >> without going through the host bridge in order to achieve the required
> > >> performance.
> > >>
> > >> To support this multi-device IOMMU group P2P scenario in virtualization,
> > >> the VM may need to use the same MMIO BAR addresses as the host physical
> > >> address layout.
> > >>  
> > > 
> > > Did you consider implementing this using Enhanced Allocation (EA)? If so,
> > > could you explain why it is not suitable here?  
> > 
> > I have not evaluated EA for this design. When I looked at EDK2, I
> > chose PcdPciDisableBusEnumeration because it cleanly preserves fixed
> > BAR programming established by the hypervisor — at the cost of QEMU
> > performing PCI bus number and resource assignment.
> > 
> > I did a quick search and do not see EA support in EDK2. Any pointers
> > to EA being used in a similar fashion to achieve fixed BAR placement
> > would be appreciated.
> 
> EA wasn't on my radar either, but I did some research and chatted with
> Tushar and I think it could work.  I'll sketch out a rough idea of what
> it might looks like.
> 
> EA describes BAR equivalents (fixed base address, size, and type) in a
> separate capability while the corresponding device BAR registers appear
> unimplemented.  Linux already consumes endpoint EA capabilities and
> marks the resulting resources IORESOURCE_PCI_FIXED.  EDK2 doesn't know
> about EA (cap 0x14 isn't defined anywhere in MdePkg, and PciBusDxe
> never consults it afaict), but that turns out to be useful here rather
> than a problem.
> 
> Starting at the QEMU device, for a vfio-pci device we'd need to
> virtualize the real BARs as unimplemented and surface that information
> via a synthesized EA capability instead.  It's debatable whether this
> is a generic PCI mechanism or vfio-pci specific, whether HPA is
> automatically used as the base address for vfio-pci devices or
> user-specified, and the capability offset in config space.  None of
> those fundamentally change the shape of the flow.
> 
> For the absolute bare-minimum level of support (EA device on the root
> complex, EA resources don't overlap the VM address space or MMIO range,
> EDK2 firmware, Linux guest booted with pci=nocrs) I think this actually
> works with just adding the EA capability above.  Let's walk through
> those constraints and how we relax them.
> 
> At the firmware level we lean on the real BAR registers being
> unimplemented for EA devices, so EDK2 allocates no MMIO or IO resources
> for them.  Only bus numbers get assigned if the EA device sits in a PCI
> hierarchy.  That's exactly what we want, EDK2 doing conventional bus
> assignment but staying out of the EA resource flow entirely.
> 
> Instead of firmware EA enlightenment we lean on the guest OS.  Linux
> reads endpoint EA today, but the bridge aperture sizing path ignores
> those fixed resources.  As Tushar's series demonstrates, generically
> handling mixed "fixed-BAR" and programmable-BAR devices in one
> hierarchy is hard.  An incremental Linux enhancement that greatly
> simplifies the problem space would be to program bridge apertures only
> for hierarchies consisting entirely of fixed resources.  The math
> becomes trivial (window spans min..max of fixed children, aligned to
> bridge granularity), and there's no regression risk, these hierarchies
> currently fail silently.  The sizer ignores fixed children and the
> fixed-claim walk-up finds no containing parent.  This enhancement,
> plus the homogeneous-hierarchy constraint, removes the root-complex
> constraint and lets us mirror the bare-metal topologies we need.
> 
> Resource ranges are a bit messier.  The extent of the EA device ranges
> could be determined in QEMU and the VM address map adjusted to prevent
> overlap.  Tushar already has a similar user-specified machine option in
> this series.  That range also needs to reach the guest as a CRS (to
> avoid pci=nocrs) but needs to stay distinct from the DT range passed to
> EDK2 for programmable BAR devices so EDK2 won't place a programmable
> BAR or bridge window into the EA region.  So long as we keep EA and
> programmable devices in separate hierarchies, EDK2 only needs the
> programmable range via DT and we can add the EA range as additional CRS
> ranges visible only to the guest.
> 
> In practice, EDK2 programs all the programmable devices and the EA
> devices live entirely in the additional CRS.  A possibly cleaner
> alternative is additional PXB host bridges for the EA devices, each
> with its own CRS.  That sidesteps the DT/CRS split entirely since the
> EA PXB has nothing for EDK2 to allocate anyway.
> 
> If we agree that homogeneous hierarchies (no mixing of EA and
> programmable BARs) is a reasonable constraint, and possibly extend that
> to homogeneous per host bridge to simplify the CRS mapping, we have the
> following work items:
> 
>  * Extend Linux EA support to program bridge apertures for subordinate
>    homogeneous EA hierarchies.
> 
>  * Develop options to virtualize programmable BARs as EA for vfio-pci
>    devices, if not generically for the benefit of testing.
> 
>  * Implement a way to poke holes in the VM address space and plumb
>    through to account for addresses used by EA devices.
> 
>  * Provide those same ranges to the guest via CRS (but not via DT to
>    EDK2), or alternatively expose them through additional PXB host
>    bridges.
> 
> Does that shape roughly seem accurate?  Are there additional gaps I've
> missed?  Thanks,
> 
> Alex


just one question why not do it in firmware so windows
is thinkably also handled?



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#121944): https://edk2.groups.io/g/devel/message/121944
Mute This Topic: https://groups.io/mt/119221703/21656
Group Owner: [email protected]
Unsubscribe: https://edk2.groups.io/g/devel/unsub [[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-


Reply via email to