On 02/22/16 08:58, Ni, Ruiyu wrote:
> Marcel, Laszlo,
> I went back to read the PciHostBridgeDxe driver in OvmfPkg and
> below is my understanding to this driver's behavior:
> The driver reads QEMU config "etc/extra-pci-roots" and promotes
> bus from #1 to #extra-pci-roots to root bridges. Supposing there are
> 10 buses and extra-pci-roots is 3, the bus #1, #2, #3 are promoted to
> root bridge #1 #2 and #3 while the other buses are still behind main
> bus #0.
No, it doesn't work like this.
(1) If QEMU provides exactly one root bridge, then the driver creates
exactly one PCI root bridge protocol instance.
The bus number of the bridge is 0: PciRoot(0x0).
The subordinate bus number (i.e., the highest bus number that can be
assigned to any bridge, as secondary bus number, that is recursively
behind this root bus) is 255.
So it will produce a bus number "aperture" of 0..255, inclusive.
(2) The fw_cfg file called "etc/extra-pci-roots", if present, exposes
the number of *extra* (that is, additional) root buses. This is a flat
number (a scalar), and it does not imply the bus number ranges
(apertures) that will be assigned to the individual root buses. This
number is just a performance optimization for the loop described below.
PciRoot(0x0) is always assumed to be there.
Then the loop probes function #0 of all 32 devices of each bus in the
1..255 range (inclusive), reading the Vendor ID register from PCI config
space.
Because bridges have not been configured yet when this loop runs (that
is, secondary bus numbers have not been assigned yet -- that's done by
the PCI bus driver, later), only such devices can respond to this
probing that sit directly on root buses.
So, we equate the existence of an "extra" root bus with the fact that at
least one device sitting on it "responds" (with its function #0).
The number read from "etc/extra-pci-roots" is only used for terminating
the scan as soon as we find the last extra root bus, without having to
scan up to 255 in all cases.
(3) Now, assume QEMU has been configured to provide 3 additional (=
extra) root buses, and that the probiing in (2) will detect the
following bus numbers as alive: 7, 11, 15 (decimal).
In this case, the driver will produce four (4) PCI root bridge protocol
instances, with the following resource apertures:
- PciRoot(0x0):
bus aperture: 0x00 through 0x06, inclusive
IO port aperture: 0x0000 through 0xFFFF, inclusive
MMIO aperture: 0x8000_0000 through 0xFFFF_FFFF, inclusive
- PciRoot(0x7):
bus aperture: 0x07 through 0x0A, inclusive
IO port aperture: 0x0000 through 0xFFFF, inclusive
(same as above)
MMIO aperture: 0x8000_0000 through 0xFFFF_FFFF, inclusive
(same as above)
- PciRoot(0xB):
bus aperture: 0x0B through 0x0E, inclusive
IO port aperture: 0x0000 through 0xFFFF, inclusive
(same as above)
MMIO aperture: 0x8000_0000 through 0xFFFF_FFFF, inclusive
(same as above)
- PciRoot(0xF):
bus aperture: 0x0F through 0xFF, inclusive
IO port aperture: 0x0000 through 0xFFFF, inclusive
(same as above)
MMIO aperture: 0x8000_0000 through 0xFFFF_FFFF, inclusive
(same as above)
In short, the 0x00 .. 0xFF interval is simply divided up consecutively
between the root buses that the loop finds. The last extra root bus (in
the example: 15 decimal) will get a bus number aperture that extends to
255 (decimal).
The IO port aperture for each root bus is 0..65535, inclusive.
The MMIO aperture for each root bus is 2GB..4GB-1, inclusive.
When the actual resource allocation occurs (IO port, and MMIO), the gDS
memory and IO space allocation services will automatically ensure that
there is no overlap between device BARs (nor bridge reservations).
Also, these allocations will be satisfied from the MMIO and IO space
ranges that the DXE core creates, based on the resource descriptor HOBs
that OVMF installs still in PEI.
So, the intent of the above is:
- detect the PCI root buses that respond, dynamically
- divide up the 0..255 (inclusive) bus range between them, naively
- in each individual bus range determined, the lowest bus number within
that interval is the bus number of the root bus itself; while the
rest of the bus numbers in the interval can be used as secondary bus
numbers for bridges that are (recursively) behind that root bus
- practically ignore IO port and MMIO apertures: the root bridges don't
have to be separated from each other on this level, *and* the actual,
individual allocations will be mutually exclusive anyway, due to the
DXE services doing their job.
> I am thinking if we change the PciHostBridgeDxe driver to only
> expose one root bridge (main bus), what it will break?
- Booting off devices that are behind the "extra" root bridges
(recursively) will break.
- Devices behind the "extra" root buses will not be enumerated; the
firmware will not assign resources to them. In turn QEMU will not
generate ACPI descriptions for them. These devices will not be usable to
the OS at runtime.
> The behavior of PciHostBridgeDxe to whether install multiple
> root bridges or single root bridge doesn't impact OS behavior.
> OS doesn't query the DXE core protocol database to find
> all the root bridge IO instances.
Correct, but the generic PCI bus driver consumes the PCI root bridge IO
protocol instances, for device enumeration and resource allocation.
> So why not we just simply the
> driver to expose one root bridge covering the main bus?
See above.
In more detail:
(a) It is valid to have a boot order that lists several devices that
exist behind different root buses.
For example, you can have a boot order where
- the first option is an assigned physical network device (e.g., a
virtual function of an SR-IOV network card),
- the second option is an emulated NIC (behind a different root bridge),
- and the third boot option is a virtio-scsi CD-ROM (where the
virtio-scsi host bus adapter is behind *yet another* root bridge).
(b) This is related to (a). Assume that an OS boot loader program --
which is a UEFI application -- is loaded from PciRoot(0xB)/...
The boot loader program can retrieve the loaded image protocol instance
from its own image handle, and create a new File device path, in order
to access *sibling* files. That is, it might want to open config files,
or other files, that reside in the same directory where it was loaded
from (regardless of PXE, disk, etc -- the point is that the device path
starts with PciRoot(0xB)).
(c) The third reason is ACPI. After PCI enumeration and resource
assignment completes, OVMF kicks QEMU. QEMU then generates ACPI payload
dynamically; OVMF downloads it and installs the ACPI tables for the OS
to consume.
The DSDT in this payload lists all of the root bridges (the main one and
the extra ones as well), with their resource apertures. If a device
lives (recursively) behind an extra root bridge that is not advertised
in ACPI, then the OS won't see that device.
Thanks
Laszlo
>
>
> Regards,
> Ray
>
>
>> -----Original Message-----
>> From: Marcel Apfelbaum [mailto:[email protected]]
>> Sent: Monday, February 8, 2016 6:56 PM
>> To: Ni, Ruiyu <[email protected]>; Laszlo Ersek <[email protected]>
>> Cc: Justen, Jordan L <[email protected]>; [email protected];
>> Tian, Feng <[email protected]>; Fan, Jeff <[email protected]>
>> Subject: Re: [edk2] [Patch V4 4/4] MdeModulePkg: Add generic
>> PciHostBridgeDxe driver.
>>
>> Hi,
>>
>> I am sorry for the noise, I am re-sending this mail from an e-mail address
>> subscribed to the list.
>>
>> Thanks,
>> Marcel
>>
>> On 02/08/2016 12:41 PM, Marcel Apfelbaum wrote:
>>> On 02/06/2016 09:09 AM, Ni, Ruiyu wrote:
>>>> Marcel,
>>>> Please see my reply embedded below.
>>>>
>>>> On 2016-02-02 19:07, Laszlo Ersek wrote:
>>>>> On 02/01/16 16:07, Marcel Apfelbaum wrote:
>>>>>> On 01/26/2016 07:17 AM, Ni, Ruiyu wrote:
>>>>>>> Laszlo,
>>>>>>> I now understand your problem.
>>>>>>> Can you tell me why OVMF needs multiple root bridges support?
>>>>>>> My understanding to OVMF is it's a firmware which can be used in a
>>>>>>> guest VM
>>>>>>> environment to boot OS.
>>>>>>> Multiple root bridges requirement currently mainly comes from
>> high-end
>>>>>>> servers.
>>>>>>> Do you mean that the VM guest needs to be like a high-end server?
>>>>>>> This may help me to think about the possible solution to your problem.
>>>>>> Hi Ray,
>>>>>>
>>>>>> Laszlo's explanation is very good, this is not exactly about high-end
>>>>>> VMs,
>>>>>> we need the extra root bridges to match assigned devices to their
>>>>>> corresponding NUMA node.
>>>>>>
>>>>>> Regarding the OVMF issue, the main problem is that the extra root
>>>>>> bridges are created dynamically
>>>>>> for the VMs (command line parameter) and their resources are
>> computed on
>>>>>> the fly.
>>>>>>
>>>>>> Not directly related to the above, the optimal way to allocate resources
>>>>>> for PCI root bridges
>>>>>> sharing the same PCI domain is to sort devices MEM/IO ranges from the
>>>>>> biggest to smallest
>>>>>> and use this order during allocation.
>>>>>>
>>>>>> After the resources allocation is finished we can build the CRS for each
>>>>>> PCI root bridge
>>>>>> and pass it back to firmware/OS.
>>>>>>
>>>>>> While for "real" machines we can hard-code the root bridge resources in
>>>>>> some ROM and have it
>>>>>> extracted early in the boot process, for the VM world this would not be
>>>>>> possible. Also
>>>>>> any effort to divide the resources range before the resource allocation
>>>>>> would be odd and far from optimal.
>>
>> Hi Ray,
>> Thank you for your response,
>>
>>>> Real machine uses hard-code resources for root bridges. But when the
>> resource
>>>> cannot meet certain root bridges' requirement, firmware can save the real
>> resource
>>>> requirement per root bridges to NV storage and divide the resources to
>> each root
>>>> bridge in next boot according to the NV settings.
>>>> The MMIO/IO routine in the real machine I mentioned above needs to be
>> fixed
>>>> in a very earlier phase before the PciHostBridgeDxe driver runs. That's to
>> say if
>>>> [2G, 2.8G) is configured to route to root bridge #1, only [2G, 2.8G) is
>> allowed to
>>>> assigned to root bride #1. And the routine cannot be changed unless a
>> platform
>>>> reset is performed.
>>
>> I understand.
>>
>>>>
>>>> Based on your description, it sounds like all the root bridges in OVMF
>>>> share
>> the
>>>> same range of resource and any MMIO/IO in the range can be route to any
>> root
>>>> bridge. For example, every root bridge can use [2G, 3G) MMIO.
>>>
>>> Exactly. This is true for "snooping" host-bridges which do not have their
>>> own
>>> configuration registers (or MMConfig region). They are sniffing host-bridge
>> 0
>>> for configuration cycles and if the are meant for a device on a bus number
>>> owned by them, they will forward the transaction to their primary root bus.
>>>
>>> Until in
>>>> allocation phase, root bridge #1 is assigned to [2G, 2.8G), #2 is assigned
>>>> to [2.8G, 2.9G), #3 is assigned to [2.9G, 3G).
>>
>> Correct, but the regions do not have to be disjoint in the above scenario.
>> root bridge #1 can have [2G,2.4G) and [2.8,3G) while root bridge #1 can have
>> [2.4,2.8).
>>
>> This is so the firmware can distribute the resources in an optimal way. An
>> example can be:
>> - root bridge #1 has a PCI device A with a huge BAR and a PCI device B
>> with a little BAR.
>> - root bridge #2 has aPCI device C with a medium BAR.
>> The best way to distribute resources over [2G, 3G) is A BAR, C BAR, and only
>> then B BAR.
>>
>>>> So it seems that we need a way to tell PciHostBridgeDxe driver from the
>> PciHostBridgeLib
>>>> that all resources are sharable among all root bridges.
>>
>> This is exactly what we need, indeed.
>>
>>>>
>>>> The real platform case is the allocation per root bridge and OVMF case is
>> the allocation
>>>> per PCI domain.
>>
>> Indeed, bare metal servers use different PCI domain per host bridge, but I've
>> actually seen
>> real servers that have multiple root bridges sharing the same PCI domain, 0.
>>
>>
>>>> Is my understanding correct?
>>
>> It is, and thank you for taking your time to understand the issue,
>> Marcel
>>
>>>>
>>> [...]
_______________________________________________
edk2-devel mailing list
[email protected]
https://lists.01.org/mailman/listinfo/edk2-devel