On 02/22/16 11:57, Marcel Apfelbaum wrote:
> On 02/22/2016 09:58 AM, Ni, Ruiyu wrote:
>> Marcel, Laszlo,
>
> Hi,
>
>> I went back to read the PciHostBridgeDxe driver in OvmfPkg and
>> below is my understanding to this driver's behavior:
>> The driver reads QEMU config "etc/extra-pci-roots" and promotes
>> bus from #1 to #extra-pci-roots to root bridges. Supposing there are
>> 10 buses and extra-pci-roots is 3, the bus #1, #2, #3 are promoted to
>> root bridge #1 #2 and #3 while the other buses are still behind main
>> bus #0.
>
> Laszlo implemented it and he can provide more information, but I can say
> the other buses will not always be behind the main bus #0.
>
> The way it works is:
> - scans bus #0 and all the buses behind it (by searching for PCI bridges)
> - once the first PCI hierarchy is finished, if extra-pci-roots > 0
> continues to search
> for other PCI roots (until it finds all extra-pci-roots)
> - for every extra PCI root scans again all the buses behind it.
>
> So we can have actually secondary buses on the other PCI root buses as
> well.
This is a correct logical description, but the layering in UEFI is
different.
* First the root bridges are exposed with their resource apertures (bus
number range, IO port range, MMIO range). The UEFI abstractions for this
are the EFI_PCI_ROOT_BRIDGE_IO_PROTOCOL and the
EFI_PCI_HOST_BRIDGE_RESOURCE_ALLOCATION_PROTOCOL.
These protocols are supposed to work together. More precisely, the same
DXE driver is expected to produce the above protocols. The driver should
install:
- one EFI_PCI_ROOT_BRIDGE_IO_PROTOCOL instance per root bridge that
exists, and
- one EFI_PCI_HOST_BRIDGE_RESOURCE_ALLOCATION_PROTOCOL in total (*)
that knows about all of the root bridges, and manages the allocations.
(*) This holds for the single host bridge case; which in our case does
apply. The point is, it should be an (n:1) multiplicity for
(root bridge IO proto : host bridge alloc proto).
(Of course the specialty in our case is that the MMIO and IO port ranges
are not distinct between each outher, and that we detect the root
bridges themselves at runtime.)
* Second, *consuming* the above (n + 1) protocol instances, described
above, it is the job of the *generic* PCI bus driver to enumerate
devices, bridges; assign secondary bus numbers to bridges, from the bus
number aperture of each root bridge; and to set the subordinate bus
number for each bridge (incl. root bridges) after the enumeration is
complete.
This driver is generic -- ISA and platform independent. This allows the
platform vendor to worry about the root bridge / host bridge driver
only. Once those abstractions are exposed, the generic PCI bus driver
can work.
>
>
>>
>> I am thinking if we change the PciHostBridgeDxe driver to only
>> expose one root bridge (main bus), what it will break?
>>
>> The behavior of PciHostBridgeDxe to whether install multiple
>> root bridges or single root bridge doesn't impact OS behavior.
>> OS doesn't query the DXE core protocol database to find
>> all the root bridge IO instances. So why not we just simply the
>> driver to expose one root bridge covering the main bus?
>>
>
> I'll try to rephrase the question in order to be sure I understand it.
> "Why do we need the extra PCI roots at all if they are in the same PCI
> domain
> and share the same resources?"
>
> The short answer is that one PCI root can be associated by the OSes
> with only one NUMA node.
Okay, with this you are answering the question "why do you guys need
this" -- my answer focused more on the "why this way exactly" question,
I think. It's good to have a full top-to-bottom explanation.
Thanks!
Laszlo
> Now to the long answer:
> What happens if we have a VM with memory/cpus from multiple host NUMA nodes
> and we want to assign a PCI device from one of the host NUMA nodes?
> The only way we can associate this device with the correct NUMA node is
> by putting
> it behind a PCI root bridge in the proximity of that NUMA node, otherwise
> the performance will greatly suffer.
>
> The above is also true for bare metal machines, I looked again and found
> this machine
> having this kind of configuration:
>
> System:
> IBM System x3550 M4 Server
>
> lspci -vt:
> -+-[0000:ff]-+-08.0 Intel Corporation Xeon E5/Core i7 QPI Link 0
> | +-08.2 Intel Corporation Device 3c41
> [...]
> | +-13.5 Intel Corporation Xeon E5/Core i7 Ring to QuickPath
> Interconnect Link 0 Performance Monitor
> | \-13.6 Intel Corporation Xeon E5/Core i7 Ring to QuickPath
> Interconnect Link 1 Performance Monitor
> +-[0000:80]-+-00.0-[81-85]--
> | +-02.0-[86-8a]--
> | [...]
> | +-05.0 Intel Corporation Xeon E5/Core i7 Address Map,
> VTd_Misc, System Management
> | \-05.2 Intel Corporation Xeon E5/Core i7 Control Status
> and Global Errors
> +-[0000:7f]-+-08.0 Intel Corporation Xeon E5/Core i7 QPI Link 0
> | +-08.2 Intel Corporation Device 3c41
> | +-08.3 Intel Corporation Xeon E5/Core i7 QPI Link Reut 0
> | [...]
> | +-13.5 Intel Corporation Xeon E5/Core i7 Ring to QuickPath
> Interconnect Link 0 Performance Monitor
> | \-13.6 Intel Corporation Xeon E5/Core i7 Ring to QuickPath
> Interconnect Link 1 Performance Monitor
> \-[0000:00]-+-00.0 Intel Corporation Xeon E5/Core i7 DMI2
> +-01.0-[0c-10]--
> +-02.0-[11-15]--+-00.0 Intel Corporation 82599ES
> 10-Gigabit SFI/SFP+ Network Connection
> | \-00.1 Intel Corporation 82599ES
> 10-Gigabit SFI/SFP+ Network Connection
> [...]
>
>
> iasl DSDT:
>
>
> [...]
> Name (\BBI0, 0x00000000)
> Name (\BBI1, 0x00000080)
> [...]
>
> Scope (\_SB)
> {
> [...]
> Device (IOH0)
> {
> Name (_HID, EisaId ("PNP0A08") /* PCI Express Bus */) //
> _HID: Hardware ID
> Name (_CID, EisaId ("PNP0A03") /* PCI Bus */) // _CID:
> Compatible ID
> Name (_UID, 0x00) // _UID: Unique ID
> Method (_BBN, 0, NotSerialized) // _BBN: BIOS Bus Number
> {
> Return (BBI0) /* \BBI0 */
> }
> [...]
> Name (PBR0, ResourceTemplate ()
> {
> WordBusNumber (ResourceProducer, MinFixed, MaxFixed,
> PosDecode,
> 0x0000, // Granularity
> 0x0000, // Range Minimum
> 0x007F, // Range Maximum
> 0x0000, // Translation Offset
> 0x0080, // Length
> ,, )
> IO (Decode16,
> 0x0CF8, // Range Minimum
> 0x0CF8, // Range Maximum
> 0x01, // Alignment
> 0x08, // Length
> )
> WordIO (ResourceProducer, MinFixed, MaxFixed, PosDecode,
> EntireRange,
> 0x0000, // Granularity
> 0x0000, // Range Minimum
> 0x0CF7, // Range Maximum
> 0x0000, // Translation Offset
> 0x0CF8, // Length
> ,, , TypeStatic)
> WordIO (ResourceProducer, MinFixed, MaxFixed, PosDecode,
> EntireRange,
> 0x0000, // Granularity
> 0x1000, // Range Minimum
> 0xBFFF, // Range Maximum
> 0x0000, // Translation Offset
> 0xB000, // Length
> ,, , TypeStatic)
> [...]
> }
> /* the above range will be part of CRS after some logic */
> [...]
> }
> Device (IOH1)
> {
> Name (_HID, EisaId ("PNP0A08") /* PCI Express Bus */) //
> _HID: Hardware ID
> Name (_CID, EisaId ("PNP0A03") /* PCI Bus */) // _CID:
> Compatible ID
> Name (_UID, 0x01) // _UID: Unique ID
> Method (_BBN, 0, NotSerialized) // _BBN: BIOS Bus Number
> {
> Return (BBI1) /* \BBI1 */
> }
> [...]
> Name (PBR0, ResourceTemplate ()
> {
> WordBusNumber (ResourceProducer, MinFixed, MaxFixed,
> PosDecode,
> 0x0000, // Granularity
> 0x0080, // Range Minimum
> 0x00FF, // Range Maximum
> 0x0000, // Translation Offset
> 0x0080, // Length
> ,, )
> WordIO (ResourceProducer, MinFixed, MaxFixed, PosDecode,
> EntireRange,
> 0x0000, // Granularity
> 0xC000, // Range Minimum
> 0xFFFF, // Range Maximum
> 0x0000, // Translation Offset
> 0x4000, // Length
> ,, , TypeStatic)
> }
> [...]
>
> As you can see we have multiple PCI roots sharing the PCI domain 0
> resources.
> I found this configuration quite common in the machines I work with.
> Those machines have BIOS and not the URFI firmware, but I really think
> the edk2 will benefit from being compatible with the above.
>
> I hope I helped understanding the issue,
> Marcel
>
>
>
>>
>> Regards,
>> Ray
>>
>>
>>> -----Original Message-----
>>> From: Marcel Apfelbaum [mailto:[email protected]]
>>> Sent: Monday, February 8, 2016 6:56 PM
>>> To: Ni, Ruiyu <[email protected]>; Laszlo Ersek <[email protected]>
>>> Cc: Justen, Jordan L <[email protected]>;
>>> [email protected];
>>> Tian, Feng <[email protected]>; Fan, Jeff <[email protected]>
>>> Subject: Re: [edk2] [Patch V4 4/4] MdeModulePkg: Add generic
>>> PciHostBridgeDxe driver.
>>>
>>> Hi,
>>>
>>> I am sorry for the noise, I am re-sending this mail from an e-mail
>>> address
>>> subscribed to the list.
>>>
>>> Thanks,
>>> Marcel
>>>
>>> On 02/08/2016 12:41 PM, Marcel Apfelbaum wrote:
>>>> On 02/06/2016 09:09 AM, Ni, Ruiyu wrote:
>>>>> Marcel,
>>>>> Please see my reply embedded below.
>>>>>
>>>>> On 2016-02-02 19:07, Laszlo Ersek wrote:
>>>>>> On 02/01/16 16:07, Marcel Apfelbaum wrote:
>>>>>>> On 01/26/2016 07:17 AM, Ni, Ruiyu wrote:
>>>>>>>> Laszlo,
>>>>>>>> I now understand your problem.
>>>>>>>> Can you tell me why OVMF needs multiple root bridges support?
>>>>>>>> My understanding to OVMF is it's a firmware which can be used in a
>>>>>>>> guest VM
>>>>>>>> environment to boot OS.
>>>>>>>> Multiple root bridges requirement currently mainly comes from
>>> high-end
>>>>>>>> servers.
>>>>>>>> Do you mean that the VM guest needs to be like a high-end server?
>>>>>>>> This may help me to think about the possible solution to your
>>>>>>>> problem.
>>>>>>> Hi Ray,
>>>>>>>
>>>>>>> Laszlo's explanation is very good, this is not exactly about
>>>>>>> high-end VMs,
>>>>>>> we need the extra root bridges to match assigned devices to their
>>>>>>> corresponding NUMA node.
>>>>>>>
>>>>>>> Regarding the OVMF issue, the main problem is that the extra root
>>>>>>> bridges are created dynamically
>>>>>>> for the VMs (command line parameter) and their resources are
>>> computed on
>>>>>>> the fly.
>>>>>>>
>>>>>>> Not directly related to the above, the optimal way to allocate
>>>>>>> resources
>>>>>>> for PCI root bridges
>>>>>>> sharing the same PCI domain is to sort devices MEM/IO ranges from
>>>>>>> the
>>>>>>> biggest to smallest
>>>>>>> and use this order during allocation.
>>>>>>>
>>>>>>> After the resources allocation is finished we can build the CRS
>>>>>>> for each
>>>>>>> PCI root bridge
>>>>>>> and pass it back to firmware/OS.
>>>>>>>
>>>>>>> While for "real" machines we can hard-code the root bridge
>>>>>>> resources in
>>>>>>> some ROM and have it
>>>>>>> extracted early in the boot process, for the VM world this would
>>>>>>> not be
>>>>>>> possible. Also
>>>>>>> any effort to divide the resources range before the resource
>>>>>>> allocation
>>>>>>> would be odd and far from optimal.
>>>
>>> Hi Ray,
>>> Thank you for your response,
>>>
>>>>> Real machine uses hard-code resources for root bridges. But when the
>>> resource
>>>>> cannot meet certain root bridges' requirement, firmware can save
>>>>> the real
>>> resource
>>>>> requirement per root bridges to NV storage and divide the resources to
>>> each root
>>>>> bridge in next boot according to the NV settings.
>>>>> The MMIO/IO routine in the real machine I mentioned above needs to be
>>> fixed
>>>>> in a very earlier phase before the PciHostBridgeDxe driver runs.
>>>>> That's to
>>> say if
>>>>> [2G, 2.8G) is configured to route to root bridge #1, only [2G,
>>>>> 2.8G) is
>>> allowed to
>>>>> assigned to root bride #1. And the routine cannot be changed unless a
>>> platform
>>>>> reset is performed.
>>>
>>> I understand.
>>>
>>>>>
>>>>> Based on your description, it sounds like all the root bridges in
>>>>> OVMF share
>>> the
>>>>> same range of resource and any MMIO/IO in the range can be route to
>>>>> any
>>> root
>>>>> bridge. For example, every root bridge can use [2G, 3G) MMIO.
>>>>
>>>> Exactly. This is true for "snooping" host-bridges which do not have
>>>> their own
>>>> configuration registers (or MMConfig region). They are sniffing
>>>> host-bridge
>>> 0
>>>> for configuration cycles and if the are meant for a device on a bus
>>>> number
>>>> owned by them, they will forward the transaction to their primary
>>>> root bus.
>>>>
>>>> Until in
>>>>> allocation phase, root bridge #1 is assigned to [2G, 2.8G), #2 is
>>>>> assigned
>>>>> to [2.8G, 2.9G), #3 is assigned to [2.9G, 3G).
>>>
>>> Correct, but the regions do not have to be disjoint in the above
>>> scenario.
>>> root bridge #1 can have [2G,2.4G) and [2.8,3G) while root bridge #1
>>> can have
>>> [2.4,2.8).
>>>
>>> This is so the firmware can distribute the resources in an optimal
>>> way. An
>>> example can be:
>>> - root bridge #1 has a PCI device A with a huge BAR and a PCI
>>> device B
>>> with a little BAR.
>>> - root bridge #2 has aPCI device C with a medium BAR.
>>> The best way to distribute resources over [2G, 3G) is A BAR, C BAR,
>>> and only
>>> then B BAR.
>>>
>>>>> So it seems that we need a way to tell PciHostBridgeDxe driver from
>>>>> the
>>> PciHostBridgeLib
>>>>> that all resources are sharable among all root bridges.
>>>
>>> This is exactly what we need, indeed.
>>>
>>>>>
>>>>> The real platform case is the allocation per root bridge and OVMF
>>>>> case is
>>> the allocation
>>>>> per PCI domain.
>>>
>>> Indeed, bare metal servers use different PCI domain per host bridge,
>>> but I've
>>> actually seen
>>> real servers that have multiple root bridges sharing the same PCI
>>> domain, 0.
>>>
>>>
>>>>> Is my understanding correct?
>>>
>>> It is, and thank you for taking your time to understand the issue,
>>> Marcel
>>>
>>>>>
>>>> [...]
>
_______________________________________________
edk2-devel mailing list
[email protected]
https://lists.01.org/mailman/listinfo/edk2-devel