Marcel,
I see two requirements from your mail:
1. non-continuously resources: root bridge #1 uses [2G, 2.4G) and [2.8G, 3G)
while
root bridge #2 uses [2.4G, 2.8G)
2. sharable resources among root bridges: All root bridges in same pci segment
can share one common range of resources.
Requirement #1 is not supported by MdeModulePkg/PciBus driver and I guess
it's not the urgent requirement and doesn't block OVMF PciHostBridge porting.
Requirement #2 can be interpreted as it's valid when the resources claimed by
different root bridges overlap. No matter which segment they belong to.
The overlap can be like root bridge #1 claims [2G, 2.4G) while root bridge #2
claims [2.2G, 2.6G) -- [2.2G, 2.4G) is shared by both root bridges.
In such case, PCI devices under root bridge #1 can only use resources [2G, 2.4G)
and root bridge #2 can only use [2.2G, 2.6G). GCD services can guarantee there
is no resource conflict -- if [2.2G, 2.3G) is used by one device under root
bridge #1,
it won't be used by device under root bridge #2.
An extreme case is both root bridges claim [2G, 3G) which is the OVMF case.
So the change to PciHostBridgeDxe can be:
1. Checks whether the resources claimed by the root bridges are already added,
and call AddMemorySpace/AddIoSpace for those resource ranges which haven't
been added.
2. Call AllocateMemorySpace/AllocateIoSpace to occupy these resources in GCD.
The Allocation shouldn't fail, otherwise it's a fatal error and
PciHostBridgeDxe driver
will assert and exit.
Regards,
Ray
>-----Original Message-----
>From: edk2-devel [mailto:[email protected]] On Behalf Of
>Marcel Apfelbaum
>Sent: Monday, February 22, 2016 7:02 PM
>To: Ni, Ruiyu <[email protected]>; Laszlo Ersek <[email protected]>
>Cc: Justen, Jordan L <[email protected]>; [email protected];
>Tian, Feng <[email protected]>; Fan, Jeff <[email protected]>
>Subject: Re: [edk2] [Patch V4 4/4] MdeModulePkg: Add generic
>PciHostBridgeDxe driver.
>
>Hi,
>I am sorry again for the noise, I resend the mail from the appropriate mail
>address.
>
>
>On 02/22/2016 09:58 AM, Ni, Ruiyu wrote:
> > Marcel, Laszlo,
>
>Hi,
>
> > I went back to read the PciHostBridgeDxe driver in OvmfPkg and
> > below is my understanding to this driver's behavior:
> > The driver reads QEMU config "etc/extra-pci-roots" and promotes
> > bus from #1 to #extra-pci-roots to root bridges. Supposing there are
> > 10 buses and extra-pci-roots is 3, the bus #1, #2, #3 are promoted to
> > root bridge #1 #2 and #3 while the other buses are still behind main
> > bus #0.
>
>Laszlo implemented it and he can provide more information, but I can say
>the other buses will not always be behind the main bus #0.
>
>The way it works is:
> - scans bus #0 and all the buses behind it (by searching for PCI bridges)
> - once the first PCI hierarchy is finished, if extra-pci-roots > 0
> continues to
>search
> for other PCI roots (until it finds all extra-pci-roots)
> - for every extra PCI root scans again all the buses behind it.
>
>So we can have actually secondary buses on the other PCI root buses as well.
>
>
> >
> > I am thinking if we change the PciHostBridgeDxe driver to only
> > expose one root bridge (main bus), what it will break?
> >
> > The behavior of PciHostBridgeDxe to whether install multiple
> > root bridges or single root bridge doesn't impact OS behavior.
> > OS doesn't query the DXE core protocol database to find
> > all the root bridge IO instances. So why not we just simply the
> > driver to expose one root bridge covering the main bus?
> >
>
>I'll try to rephrase the question in order to be sure I understand it.
>"Why do we need the extra PCI roots at all if they are in the same PCI domain
> and share the same resources?"
>
>The short answer is that one PCI root can be associated by the OSes
>with only one NUMA node.
>
>Now to the long answer:
>What happens if we have a VM with memory/cpus from multiple host NUMA
>nodes
>and we want to assign a PCI device from one of the host NUMA nodes?
>The only way we can associate this device with the correct NUMA node is by
>putting
>it behind a PCI root bridge in the proximity of that NUMA node, otherwise
>the performance will greatly suffer.
>
>The above is also true for bare metal machines, I looked again and found this
>machine
>having this kind of configuration:
>
>System:
> IBM System x3550 M4 Server
>
>lspci -vt:
> -+-[0000:ff]-+-08.0 Intel Corporation Xeon E5/Core i7 QPI Link 0
> | +-08.2 Intel Corporation Device 3c41
> [...]
> | +-13.5 Intel Corporation Xeon E5/Core i7 Ring to
>QuickPath Interconnect Link 0 Performance Monitor
> | \-13.6 Intel Corporation Xeon E5/Core i7 Ring to
>QuickPath Interconnect Link 1 Performance Monitor
> +-[0000:80]-+-00.0-[81-85]--
> | +-02.0-[86-8a]--
> | [...]
> | +-05.0 Intel Corporation Xeon E5/Core i7 Address Map,
>VTd_Misc, System Management
> | \-05.2 Intel Corporation Xeon E5/Core i7 Control Status
>and Global Errors
> +-[0000:7f]-+-08.0 Intel Corporation Xeon E5/Core i7 QPI Link 0
> | +-08.2 Intel Corporation Device 3c41
> | +-08.3 Intel Corporation Xeon E5/Core i7 QPI Link Reut 0
> | [...]
> | +-13.5 Intel Corporation Xeon E5/Core i7 Ring to
>QuickPath Interconnect Link 0 Performance Monitor
> | \-13.6 Intel Corporation Xeon E5/Core i7 Ring to
>QuickPath Interconnect Link 1 Performance Monitor
> \-[0000:00]-+-00.0 Intel Corporation Xeon E5/Core i7 DMI2
> +-01.0-[0c-10]--
> +-02.0-[11-15]--+-00.0 Intel Corporation 82599ES
>10-Gigabit SFI/SFP+ Network Connection
> | \-00.1 Intel Corporation 82599ES
>10-Gigabit SFI/SFP+ Network Connection
> [...]
>
>
>iasl DSDT:
>
>
>[...]
> Name (\BBI0, 0x00000000)
> Name (\BBI1, 0x00000080)
>[...]
>
> Scope (\_SB)
> {
> [...]
> Device (IOH0)
> {
> Name (_HID, EisaId ("PNP0A08") /* PCI Express Bus */) //
>_HID: Hardware ID
> Name (_CID, EisaId ("PNP0A03") /* PCI Bus */) // _CID:
>Compatible ID
> Name (_UID, 0x00) // _UID: Unique ID
> Method (_BBN, 0, NotSerialized) // _BBN: BIOS Bus Number
> {
> Return (BBI0) /* \BBI0 */
> }
> [...]
> Name (PBR0, ResourceTemplate ()
> {
> WordBusNumber (ResourceProducer, MinFixed,
>MaxFixed, PosDecode,
> 0x0000, // Granularity
> 0x0000, // Range Minimum
> 0x007F, // Range Maximum
> 0x0000, // Translation Offset
> 0x0080, // Length
> ,, )
> IO (Decode16,
> 0x0CF8, // Range Minimum
> 0x0CF8, // Range Maximum
> 0x01, // Alignment
> 0x08, // Length
> )
> WordIO (ResourceProducer, MinFixed, MaxFixed,
>PosDecode, EntireRange,
> 0x0000, // Granularity
> 0x0000, // Range Minimum
> 0x0CF7, // Range Maximum
> 0x0000, // Translation Offset
> 0x0CF8, // Length
> ,, , TypeStatic)
> WordIO (ResourceProducer, MinFixed, MaxFixed,
>PosDecode, EntireRange,
> 0x0000, // Granularity
> 0x1000, // Range Minimum
> 0xBFFF, // Range Maximum
> 0x0000, // Translation Offset
> 0xB000, // Length
> ,, , TypeStatic)
> [...]
> }
> /* the above range will be part of CRS after some logic */
> [...]
> }
> Device (IOH1)
> {
> Name (_HID, EisaId ("PNP0A08") /* PCI Express Bus */) //
>_HID: Hardware ID
> Name (_CID, EisaId ("PNP0A03") /* PCI Bus */) // _CID:
>Compatible ID
> Name (_UID, 0x01) // _UID: Unique ID
> Method (_BBN, 0, NotSerialized) // _BBN: BIOS Bus Number
> {
> Return (BBI1) /* \BBI1 */
> }
> [...]
> Name (PBR0, ResourceTemplate ()
> {
> WordBusNumber (ResourceProducer, MinFixed,
>MaxFixed, PosDecode,
> 0x0000, // Granularity
> 0x0080, // Range Minimum
> 0x00FF, // Range Maximum
> 0x0000, // Translation Offset
> 0x0080, // Length
> ,, )
> WordIO (ResourceProducer, MinFixed, MaxFixed,
>PosDecode, EntireRange,
> 0x0000, // Granularity
> 0xC000, // Range Minimum
> 0xFFFF, // Range Maximum
> 0x0000, // Translation Offset
> 0x4000, // Length
> ,, , TypeStatic)
> }
>[...]
>
>As you can see we have multiple PCI roots sharing the PCI domain 0
>resources.
>I found this configuration quite common in the machines I work with.
>Those machines have BIOS and not the UEFI firmware, but I really think
>the edk2 will benefit from being compatible with the above.
>
>I hope I helped understanding the issue,
>Marcel
>
>
>
> >
> > Regards,
> > Ray
> >
> >
> >> -----Original Message-----
> >> From: Marcel Apfelbaum [mailto:[email protected]]
> >> Sent: Monday, February 8, 2016 6:56 PM
> >> To: Ni, Ruiyu <[email protected]>; Laszlo Ersek <[email protected]>
> >> Cc: Justen, Jordan L <[email protected]>;
>[email protected];
> >> Tian, Feng <[email protected]>; Fan, Jeff <[email protected]>
> >> Subject: Re: [edk2] [Patch V4 4/4] MdeModulePkg: Add generic
> >> PciHostBridgeDxe driver.
> >>
> >> Hi,
> >>
> >> I am sorry for the noise, I am re-sending this mail from an e-mail address
> >> subscribed to the list.
> >>
> >> Thanks,
> >> Marcel
> >>
> >> On 02/08/2016 12:41 PM, Marcel Apfelbaum wrote:
> >>> On 02/06/2016 09:09 AM, Ni, Ruiyu wrote:
> >>>> Marcel,
> >>>> Please see my reply embedded below.
> >>>>
> >>>> On 2016-02-02 19:07, Laszlo Ersek wrote:
> >>>>> On 02/01/16 16:07, Marcel Apfelbaum wrote:
> >>>>>> On 01/26/2016 07:17 AM, Ni, Ruiyu wrote:
> >>>>>>> Laszlo,
> >>>>>>> I now understand your problem.
> >>>>>>> Can you tell me why OVMF needs multiple root bridges support?
> >>>>>>> My understanding to OVMF is it's a firmware which can be used in a
> >>>>>>> guest VM
> >>>>>>> environment to boot OS.
> >>>>>>> Multiple root bridges requirement currently mainly comes from
> >> high-end
> >>>>>>> servers.
> >>>>>>> Do you mean that the VM guest needs to be like a high-end server?
> >>>>>>> This may help me to think about the possible solution to your
>problem.
> >>>>>> Hi Ray,
> >>>>>>
> >>>>>> Laszlo's explanation is very good, this is not exactly about high-end
>VMs,
> >>>>>> we need the extra root bridges to match assigned devices to their
> >>>>>> corresponding NUMA node.
> >>>>>>
> >>>>>> Regarding the OVMF issue, the main problem is that the extra root
> >>>>>> bridges are created dynamically
> >>>>>> for the VMs (command line parameter) and their resources are
> >> computed on
> >>>>>> the fly.
> >>>>>>
> >>>>>> Not directly related to the above, the optimal way to allocate
>resources
> >>>>>> for PCI root bridges
> >>>>>> sharing the same PCI domain is to sort devices MEM/IO ranges from
>the
> >>>>>> biggest to smallest
> >>>>>> and use this order during allocation.
> >>>>>>
> >>>>>> After the resources allocation is finished we can build the CRS for
>each
> >>>>>> PCI root bridge
> >>>>>> and pass it back to firmware/OS.
> >>>>>>
> >>>>>> While for "real" machines we can hard-code the root bridge
>resources in
> >>>>>> some ROM and have it
> >>>>>> extracted early in the boot process, for the VM world this would not
>be
> >>>>>> possible. Also
> >>>>>> any effort to divide the resources range before the resource
>allocation
> >>>>>> would be odd and far from optimal.
> >>
> >> Hi Ray,
> >> Thank you for your response,
> >>
> >>>> Real machine uses hard-code resources for root bridges. But when the
> >> resource
> >>>> cannot meet certain root bridges' requirement, firmware can save the
>real
> >> resource
> >>>> requirement per root bridges to NV storage and divide the resources to
> >> each root
> >>>> bridge in next boot according to the NV settings.
> >>>> The MMIO/IO routine in the real machine I mentioned above needs to
>be
> >> fixed
> >>>> in a very earlier phase before the PciHostBridgeDxe driver runs. That's
>to
> >> say if
> >>>> [2G, 2.8G) is configured to route to root bridge #1, only [2G, 2.8G) is
> >> allowed to
> >>>> assigned to root bride #1. And the routine cannot be changed unless
>a
> >> platform
> >>>> reset is performed.
> >>
> >> I understand.
> >>
> >>>>
> >>>> Based on your description, it sounds like all the root bridges in OVMF
>share
> >> the
> >>>> same range of resource and any MMIO/IO in the range can be route to
>any
> >> root
> >>>> bridge. For example, every root bridge can use [2G, 3G) MMIO.
> >>>
> >>> Exactly. This is true for "snooping" host-bridges which do not have their
>own
> >>> configuration registers (or MMConfig region). They are sniffing
>host-bridge
> >> 0
> >>> for configuration cycles and if the are meant for a device on a bus
>number
> >>> owned by them, they will forward the transaction to their primary root
>bus.
> >>>
> >>> Until in
> >>>> allocation phase, root bridge #1 is assigned to [2G, 2.8G), #2 is
>assigned
> >>>> to [2.8G, 2.9G), #3 is assigned to [2.9G, 3G).
> >>
> >> Correct, but the regions do not have to be disjoint in the above scenario.
> >> root bridge #1 can have [2G,2.4G) and [2.8,3G) while root bridge #1 can
>have
> >> [2.4,2.8).
> >>
> >> This is so the firmware can distribute the resources in an optimal way. An
> >> example can be:
> >> - root bridge #1 has a PCI device A with a huge BAR and a PCI device
>B
> >> with a little BAR.
> >> - root bridge #2 has aPCI device C with a medium BAR.
> >> The best way to distribute resources over [2G, 3G) is A BAR, C BAR, and
>only
> >> then B BAR.
> >>
> >>>> So it seems that we need a way to tell PciHostBridgeDxe driver from
>the
> >> PciHostBridgeLib
> >>>> that all resources are sharable among all root bridges.
> >>
> >> This is exactly what we need, indeed.
> >>
> >>>>
> >>>> The real platform case is the allocation per root bridge and OVMF case
>is
> >> the allocation
> >>>> per PCI domain.
> >>
> >> Indeed, bare metal servers use different PCI domain per host bridge, but
>I've
> >> actually seen
> >> real servers that have multiple root bridges sharing the same PCI domain,
>0.
> >>
> >>
> >>>> Is my understanding correct?
> >>
> >> It is, and thank you for taking your time to understand the issue,
> >> Marcel
> >>
> >>>>
> >>> [...]
>
>
>_______________________________________________
>edk2-devel mailing list
>[email protected]
>https://lists.01.org/mailman/listinfo/edk2-devel
_______________________________________________
edk2-devel mailing list
[email protected]
https://lists.01.org/mailman/listinfo/edk2-devel