On 02/23/2016 07:13 AM, Ni, Ruiyu wrote:
Marcel,
I see two requirements from your mail:
1. non-continuously resources: root bridge #1 uses [2G, 2.4G) and [2.8G, 3G) 
while
   root bridge #2 uses [2.4G, 2.8G)
2. sharable resources among root bridges: All root bridges in same pci segment
   can share one common range of resources.

Requirement #1 is not supported by MdeModulePkg/PciBus driver and I guess
it's not the urgent requirement and doesn't block OVMF PciHostBridge porting.

Requirement #2 can be interpreted as it's valid when the resources claimed by
different root bridges overlap. No matter which segment they belong to.
        
The overlap can be like root bridge #1 claims [2G, 2.4G) while root bridge #2
claims [2.2G, 2.6G) -- [2.2G, 2.4G) is shared by both root bridges.
In such case, PCI devices under root bridge #1 can only use resources [2G, 2.4G)
and root bridge #2 can only use [2.2G, 2.6G). GCD services can guarantee there
is no resource conflict -- if [2.2G, 2.3G) is used by one device under root 
bridge #1,
it won't be used by device under root bridge #2.

An extreme case is both root bridges claim [2G, 3G) which is the OVMF case.

So the change to PciHostBridgeDxe can be:
1. Checks whether the resources claimed by the root bridges are already added,
and call AddMemorySpace/AddIoSpace for those resource ranges which haven't
been added.
2. Call AllocateMemorySpace/AllocateIoSpace to occupy these resources in GCD.
The Allocation shouldn't fail, otherwise it's a fatal error and 
PciHostBridgeDxe driver
will assert and exit.

Hi Ray,
Thank you for considering the options to support multiple root bridges
in the same PCI domain.

Since I am new to OVMF I'll let Laszlo getting into the details
and I will try to keep up with the conversation.

Thanks,
Marcel


Regards,
Ray


-----Original Message-----
From: edk2-devel [mailto:edk2-devel-boun...@lists.01.org] On Behalf Of
Marcel Apfelbaum
Sent: Monday, February 22, 2016 7:02 PM
To: Ni, Ruiyu <ruiyu...@intel.com>; Laszlo Ersek <ler...@redhat.com>
Cc: Justen, Jordan L <jordan.l.jus...@intel.com>; edk2-de...@ml01.01.org;
Tian, Feng <feng.t...@intel.com>; Fan, Jeff <jeff....@intel.com>
Subject: Re: [edk2] [Patch V4 4/4] MdeModulePkg: Add generic
PciHostBridgeDxe driver.

Hi,
I am sorry again for the noise, I resend the mail from the appropriate mail
address.


On 02/22/2016 09:58 AM, Ni, Ruiyu wrote:
Marcel, Laszlo,

Hi,

I went back to read the PciHostBridgeDxe driver in OvmfPkg and
below is my understanding to this driver's behavior:
The driver reads QEMU config "etc/extra-pci-roots" and promotes
bus from #1 to #extra-pci-roots to root bridges. Supposing there are
10 buses and extra-pci-roots is 3, the bus #1, #2, #3 are promoted to
root bridge #1 #2 and #3 while the other buses are still behind main
bus #0.

Laszlo implemented it and he can provide more information, but I can say
the other buses will not always be behind the main bus #0.

The way it works is:
  - scans bus #0 and all the buses behind it (by searching for PCI bridges)
  - once the first PCI hierarchy is finished, if  extra-pci-roots > 0 continues 
to
search
    for other PCI roots (until it finds all extra-pci-roots)
  - for every extra PCI root scans again all the buses behind it.

So we can have actually secondary buses on the other PCI root buses as well.



I am thinking if we change the PciHostBridgeDxe driver to only
expose one root bridge (main bus), what it will break?

The behavior of PciHostBridgeDxe to whether install multiple
root bridges or single root bridge doesn't impact OS behavior.
OS doesn't query the DXE core protocol database to find
all the root bridge IO instances. So why not we just simply the
driver to expose one root bridge covering the main bus?


I'll try to rephrase the question in order to be sure I understand it.
"Why do we need the extra PCI roots at all if they are in the same PCI domain
  and share the same resources?"

The short answer is that one PCI root can be associated by the OSes
with only one NUMA node.

Now to the long answer:
What happens if we have a VM with memory/cpus from multiple host NUMA
nodes
and we want to assign a PCI device from one of the host NUMA nodes?
The only way we can associate this device with the correct NUMA node is by
putting
it behind a PCI root bridge in the proximity of that NUMA node, otherwise
the performance will greatly suffer.

The above is also true for bare metal machines, I looked again and found this
machine
having this kind of configuration:

System:
     IBM System x3550 M4 Server

lspci -vt:
  -+-[0000:ff]-+-08.0  Intel Corporation Xeon E5/Core i7 QPI Link 0
  |           +-08.2  Intel Corporation Device 3c41
             [...]
  |           +-13.5  Intel Corporation Xeon E5/Core i7 Ring to
QuickPath Interconnect Link 0 Performance Monitor
  |           \-13.6  Intel Corporation Xeon E5/Core i7 Ring to
QuickPath Interconnect Link 1 Performance Monitor
  +-[0000:80]-+-00.0-[81-85]--
  |           +-02.0-[86-8a]--
  |           [...]
  |           +-05.0  Intel Corporation Xeon E5/Core i7 Address Map,
VTd_Misc, System Management
  |           \-05.2  Intel Corporation Xeon E5/Core i7 Control Status
and Global Errors
  +-[0000:7f]-+-08.0  Intel Corporation Xeon E5/Core i7 QPI Link 0
  |           +-08.2  Intel Corporation Device 3c41
  |           +-08.3  Intel Corporation Xeon E5/Core i7 QPI Link Reut 0
  |           [...]
  |           +-13.5  Intel Corporation Xeon E5/Core i7 Ring to
QuickPath Interconnect Link 0 Performance Monitor
  |           \-13.6  Intel Corporation Xeon E5/Core i7 Ring to
QuickPath Interconnect Link 1 Performance Monitor
  \-[0000:00]-+-00.0  Intel Corporation Xeon E5/Core i7 DMI2
              +-01.0-[0c-10]--
              +-02.0-[11-15]--+-00.0  Intel Corporation 82599ES
10-Gigabit SFI/SFP+ Network Connection
              |               \-00.1  Intel Corporation 82599ES
10-Gigabit SFI/SFP+ Network Connection
              [...]


iasl DSDT:


[...]
     Name (\BBI0, 0x00000000)
     Name (\BBI1, 0x00000080)
[...]

  Scope (\_SB)
  {
  [...]
     Device (IOH0)
         {
             Name (_HID, EisaId ("PNP0A08") /* PCI Express Bus */)  //
_HID: Hardware ID
             Name (_CID, EisaId ("PNP0A03") /* PCI Bus */)  // _CID:
Compatible ID
             Name (_UID, 0x00)  // _UID: Unique ID
             Method (_BBN, 0, NotSerialized)  // _BBN: BIOS Bus Number
             {
                 Return (BBI0) /* \BBI0 */
             }
             [...]
             Name (PBR0, ResourceTemplate ()
             {
                 WordBusNumber (ResourceProducer, MinFixed,
MaxFixed, PosDecode,
                     0x0000,             // Granularity
                     0x0000,             // Range Minimum
                     0x007F,             // Range Maximum
                     0x0000,             // Translation Offset
                     0x0080,             // Length
                     ,, )
                 IO (Decode16,
                     0x0CF8,             // Range Minimum
                     0x0CF8,             // Range Maximum
                     0x01,               // Alignment
                     0x08,               // Length
                     )
                 WordIO (ResourceProducer, MinFixed, MaxFixed,
PosDecode, EntireRange,
                     0x0000,             // Granularity
                     0x0000,             // Range Minimum
                     0x0CF7,             // Range Maximum
                     0x0000,             // Translation Offset
                     0x0CF8,             // Length
                     ,, , TypeStatic)
                 WordIO (ResourceProducer, MinFixed, MaxFixed,
PosDecode, EntireRange,
                     0x0000,             // Granularity
                     0x1000,             // Range Minimum
                     0xBFFF,             // Range Maximum
                     0x0000,             // Translation Offset
                     0xB000,             // Length
                     ,, , TypeStatic)
                [...]
             }
          /* the above range will be part of CRS after some logic */
         [...]
        }
         Device (IOH1)
         {
             Name (_HID, EisaId ("PNP0A08") /* PCI Express Bus */)  //
_HID: Hardware ID
             Name (_CID, EisaId ("PNP0A03") /* PCI Bus */)  // _CID:
Compatible ID
             Name (_UID, 0x01)  // _UID: Unique ID
             Method (_BBN, 0, NotSerialized)  // _BBN: BIOS Bus Number
             {
                 Return (BBI1) /* \BBI1 */
             }
             [...]
             Name (PBR0, ResourceTemplate ()
             {
                 WordBusNumber (ResourceProducer, MinFixed,
MaxFixed, PosDecode,
                     0x0000,             // Granularity
                     0x0080,             // Range Minimum
                     0x00FF,             // Range Maximum
                     0x0000,             // Translation Offset
                     0x0080,             // Length
                     ,, )
                 WordIO (ResourceProducer, MinFixed, MaxFixed,
PosDecode, EntireRange,
                     0x0000,             // Granularity
                     0xC000,             // Range Minimum
                     0xFFFF,             // Range Maximum
                     0x0000,             // Translation Offset
                     0x4000,             // Length
                     ,, , TypeStatic)
             }
[...]

As you can see we have multiple PCI roots sharing the PCI domain 0
resources.
I found this configuration quite common in the machines I work with.
Those machines have BIOS and not the UEFI firmware, but I really think
the edk2 will benefit from being compatible with the above.

I hope I helped understanding the issue,
Marcel




Regards,
Ray


-----Original Message-----
From: Marcel Apfelbaum [mailto:marcel.apfelb...@gmail.com]
Sent: Monday, February 8, 2016 6:56 PM
To: Ni, Ruiyu <ruiyu...@intel.com>; Laszlo Ersek <ler...@redhat.com>
Cc: Justen, Jordan L <jordan.l.jus...@intel.com>;
edk2-de...@ml01.01.org;
Tian, Feng <feng.t...@intel.com>; Fan, Jeff <jeff....@intel.com>
Subject: Re: [edk2] [Patch V4 4/4] MdeModulePkg: Add generic
PciHostBridgeDxe driver.

Hi,

I am sorry for the noise, I am re-sending this mail from an e-mail address
subscribed to the list.

Thanks,
Marcel

On 02/08/2016 12:41 PM, Marcel Apfelbaum wrote:
On 02/06/2016 09:09 AM, Ni, Ruiyu wrote:
Marcel,
Please see my reply embedded below.

On 2016-02-02 19:07, Laszlo Ersek wrote:
On 02/01/16 16:07, Marcel Apfelbaum wrote:
On 01/26/2016 07:17 AM, Ni, Ruiyu wrote:
Laszlo,
I now understand your problem.
Can you tell me why OVMF needs multiple root bridges support?
My understanding to OVMF is it's a firmware which can be used in a
guest VM
environment to boot OS.
Multiple root bridges requirement currently mainly comes from
high-end
servers.
Do you mean that the VM guest needs to be like a high-end server?
This may help me to think about the possible solution to your
problem.
Hi Ray,

Laszlo's explanation is very good, this is not exactly about high-end
VMs,
we need the extra root bridges to match assigned devices to their
corresponding NUMA node.

Regarding the OVMF issue, the main problem is that the extra root
bridges are created dynamically
for the VMs (command line parameter) and their resources are
computed on
the fly.

Not directly related to the above, the optimal way to allocate
resources
for PCI root bridges
sharing the same PCI domain is to sort devices MEM/IO ranges from
the
biggest to smallest
and use this order during allocation.

After the resources allocation is finished we can build the CRS for
each
PCI root bridge
and pass it back to firmware/OS.

While for "real" machines we can hard-code the root bridge
resources in
some ROM and have it
extracted early in the boot process, for the VM world this would not
be
possible. Also
any effort to divide the resources range before the resource
allocation
would be odd and far from optimal.

Hi Ray,
Thank you for your response,

Real machine uses hard-code resources for root bridges. But when the
resource
cannot meet certain root bridges' requirement, firmware can save the
real
resource
requirement per root bridges to NV storage and divide the resources to
each root
bridge in next boot according to the NV settings.
The MMIO/IO routine in the real machine I mentioned above needs to
be
fixed
in a very earlier phase before the PciHostBridgeDxe driver runs. That's
to
say if
[2G, 2.8G) is configured to route to root bridge #1, only [2G, 2.8G) is
allowed to
assigned to root bride #1.  And the routine cannot be changed unless
a
platform
reset is performed.

I understand.


Based on your description, it sounds like all the root bridges in OVMF
share
the
same range of resource and any MMIO/IO in the range can be route to
any
root
bridge. For example, every root bridge can use [2G, 3G) MMIO.

Exactly. This is true for "snooping" host-bridges which do not have their
own
configuration registers (or MMConfig region). They are sniffing
host-bridge
0
for configuration cycles and if the are meant for a device on a bus
number
owned by them, they will forward the transaction to their primary root
bus.

Until in
allocation phase, root bridge #1 is assigned to [2G, 2.8G), #2 is
assigned
to [2.8G, 2.9G), #3 is assigned to [2.9G, 3G).

Correct, but the regions do not have to be disjoint in the above scenario.
root bridge #1 can have [2G,2.4G) and [2.8,3G) while root bridge #1 can
have
[2.4,2.8).

This is so the firmware can distribute the resources in an optimal way. An
example can be:
     - root bridge #1 has a PCI device A with a huge BAR and a PCI device
B
with a little BAR.
     - root bridge #2 has  aPCI device C with a medium BAR.
The best way to distribute resources over [2G, 3G) is A BAR, C BAR, and
only
then B BAR.

So it seems that we need a way to tell PciHostBridgeDxe driver from
the
PciHostBridgeLib
that all resources are sharable among all root bridges.

This is exactly what we need, indeed.


The real platform case is the allocation per root bridge and OVMF case
is
the allocation
per PCI domain.

Indeed, bare metal servers use different PCI domain per host bridge, but
I've
actually seen
real servers that have multiple root bridges sharing the same PCI domain,
0.


Is my understanding correct?

It is, and thank you for taking your time to understand the issue,
Marcel


[...]


_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel

_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel

Reply via email to