If I may chime in here (I worked at Intel back when this issue was first 
encountered when porting coreboot to Jacobsville)…  The way I ended up making 
it work was to introduce another object in devicetree (which I called a root 
bridge) to model the concept of the PCI stack.   In the picture below from 
Arthur these would be the PCI host bridge [0/1].   I called them root bridges 
because the PCI spec describes a host bridge as the path that the CPU takes to 
get to the PCI ‘domain’ in our use of it.  I had worked on a much earlier 
project (no coreboot) called Jasper Forest where we had 2 separate CPU packages 
with a bus between them and each had a separate set of PCI busses below them.  
The cpus were in the same coherency domain and would each be able to access the 
busses below the other.  This situation was what I considered to be 2 host 
bridges because there were 2 separate ways the cpus could get to the PCI 
domain, one via their own direct connection and one crossing the cpu-cpu bus 
and the sibling cpu directing the accesses and response back appropriately.  
The pci domain in this case was ‘pre-split’, allocating busses 0-0x7F to the 
first (the one with the DMI to PCH connection) cpu and busses 0x80-0xFD to the 
other.  (Each CPU also had a bus number allocated to their own C-bus, 0xFF for 
the first, 0xFE for the second which was used early in the boot process to 
configure the bus splits between them.).   After the first boot cycle was 
completed and all resources were gathered for each host bridge it was 
determined whether each had enough resources to map in everything required 
under each and if a rebalance needed to happen.  If a rebalance had to occur 
(one side needed more memory or io space) nvram variables were set and a reset 
occurred so the BIOS could set up the split according to the variables and in 
this way only the first boot (or if new devices were added to open slots) would 
be the long ones.

Maybe the above description helps with making some choices about how to do 
things, maybe not as Mariusz said there’s multiple (and probably always a 
better) way(s) of doing things.  Using my experience on Jasper Forest, 
introducing a root bridge object to devicetree gave a nice way to describe the 
system logically, and gave a mechanism for the implementor to do this 
pre-allocation of resources so we wouldn’t have to go through the possible 
reboot to balance resources as above.   The stacks (root bridges) in Xeon may 
be able to handle changing the decoding at runtime (with certain limitations 
like bus quiescence) unlike the Jasper Forest example above where the initial 
decoding of the resources between CPUs required a reset to change the values.  
Using devicetree to describe the resources was my solution to making the 
enumeration faster and simpler, at the expense of flexibility.  But, since 
these were mostly networking type platforms they were more static in nature so 
it wasn’t really thought to be an issue at the time.  (These are the same 
stacks as are used in Xeon-SP today, as well as back to what Skylake-D for 
example used).   I left Intel a few years ago before that work was completed.

Fast forward, I was doing some work porting coreboot to Skylake-D early last 
year I recalled some of the difficulties  in communicating the PCI domain 
enumerated in coreboot to Tianocore via ACPI.  I rememberd that it might have 
well been that the stacks are considered as host bridges because we could 
describe in ASL that each stack had a separate and invariable (as far as 
Tianocore was concerned) set of resources.  I think that I had actually done it 
that way, extending the acpi code in coreboot to generate a PCI host bridge 
device when the new root bridge object was encountered in the devicetree.   For 
Skylake-D work (which was eventually dropped) I had run into a problem where if 
not all the memory space in the system was allocated or reserved for things 
(meaning that holes were left in the memory space), if a device under a stack 
wasn’t allocated resources because the stack didn’t have a large enough window 
that Linux would assume these holes subtractively decoded to the PCI domain and 
try to stick them in there.  Another thing that added some complexity was that 
each stack had it’s own IOAPIC and in non-APIC mode all virtual legacy wire 
interrupts had to get forwarded down to the stack that had the PCH before 
interrupts got back to the CPU.

Not sure if any of this helps or if it just sounds like rambling, but I thought 
maybe some of these thoughts could be helpful in design decisions made in the 
future.  Personally I liked the idea of having the stacks understood in 
devicetree, but there were also some drawbacks as well.   One thing that might 
be a drawback is whether or not the stack implementation in the hardware can be 
flexible enough for what you might like to do in devicetree as far as assigning 
bus ranges, etc.  The stacks’ maximum bus number is determined by the starting 
bus number of the next stack in the line.

Some further info regarding the stacks which may influence any future designs…. 
  Intel regards a stack as being ‘implemented’ when it decodes a PCIe root 
bridge below it.  Stack-based SoC designs may not implement all stacks, as they 
may have different PCIe requirements.  The thing to understand (at least on the 
current generation of stack-based designs) is that the devices/functions that 
used to be part of what was called the Uncore (memory controllers, CSI/UPI bus 
configuration, etc) are now spread across devices 8-31 of the stacks’ root bus 
numbers.  The exception to this is stack 0 which only has uncore device 8 on 
it, because it’s also the DMI decode to the PCH complex which has (and can only 
have) devices 9-31 on it.   So, while a stack may be ‘unimplemented’ it still 
needs to have a bus number if the uncore devices on it need to be accessible 
(or at least need to not collide with other bus assignments if not needed).   
Example is the uncore integrated memory controller device(s) are now on stack 
2, devices 8-13 (SKX/CSX platforms).  Stack 2 needs a bus number assigned to it 
(via a register in stack 0 dev 8), in order to access the imc registers.  By 
default this bus number is 2 and the stack bus decoder takes precedence so any 
stack bus numbering needs to be in increasing bus number per stack.  The type 
of thing that can’t happen in this case at early boot is trying to ‘fake’ a bus 
number decoding say for some device under a pcie root bridge on the PCH, you 
wouldn’t be able to setup a pcie root bridge subordinate/secondary bus decoding 
to get to it until you’ve changed the stack bus numbering from the power on 
default.

One of the upshots of this new schema (and probably the reason that it was done 
this way in the first place) is that now none of the uncore devices use any 
MMIO resources for internal registers.  When more register space is needed, 
they will be in MMCFG space.

From: Lance Zhao <[email protected]>
Sent: Friday, March 18, 2022 12:06 AM
To: Nico Huber <[email protected]>
Cc: Arthur Heymans <[email protected]>; coreboot <[email protected]>
Subject: [coreboot] Re: Multi domain PCI resource allocation: How to deal with 
multiple root busses on one domain

Caution: This is an external email. Please take care when clicking links or 
opening attachments.

Stack idea is from 
https://www.intel.com/content/www/us/en/developer/articles/technical/utilizing-the-intel-xeon-processor-scalable-family-iio-performance-monitoring-events.html.

In linux, sometimes domain is same as "segment", I am not sure current coreboot 
on xeon_sp already cover the case of multiple segment yet.

Nico Huber <[email protected]<mailto:[email protected]>> 于2022年3月18日周五 02:50写道:
Hi Arthur,

On 17.03.22 19:03, Arthur Heymans wrote:
> Now my question is the following:
> On some Stacks there are multiple root busses, but the resources need to be
> allocated on the same window. My initial idea was to add those root busses
> as separate struct bus in the domain->link_list. However currently the
> allocator assumes only one bus on domains (and bridges).
> In the code you'll see a lot of things like
>
> for (child = domain->link_list->children; child; child = child->sibling)
>       ....

this is correct, we often (if not always by now) ignore that `link_list`
is a list itself and only walk the children of the first entry.

>
> This is fine if there is only one bus on the domain.
> Looping over link_list->next, struct bus'ses is certainly an option here,
> but I was told that having only one bus here was a design decision on the
> allocator v4 rewrite. I'm not sure how common that assumption is in the
> tree, so things could be broken in awkward ways.

I wouldn't say it was a design choice, probably rather a convenience
choice. The old concepts around multiple buses directly downstream of
a single device seemed inconsistent, AFAICT. And at the time the allo-
cator v4 was written it seemed unnecessary to keep compatibility around.

That doesn't mean we can't bring it back, of course. There is at least
one alternative, though.

The currently common case looks like this:


          PCI bus 0
             |
             v

  domain 0 --.
             |-- PCI 00:00.0
             |
             |-- PCI 00:02.0
             |
             :


Now we could have multiple PCI buses directly below the domain. But
instead of modelling this with the `link_list`, we could also model
it with an abstract "host" bus below the domain device and another
layer of "host bridge" devices in between:

          host bus
             |
             v

  domain 0 --.
             |-- PCI host bridge 0 --.
             |                       |-- PCI 00:00.0
             |                       |
             |                       `-- PCI 00:02.0
             |
             |
             |-- PCI host bridge 1 --.
             |                       |-- PCI 16:00.0
             |                       |
             |                       :
             :


I guess this would reduce complexity in generic code at the expense of
more data structures (devices) to manage. OTOH, if we'd make a final
decision for such a model, we could also get rid of the `link_list`.
Basically, setting in stone that we only allow one bus downstream of
any device node.

I'm not fully familiar with the hierarchy on Xeon-SP systems. Would
this be an adequate solution? Also, does the term `stack` map to our
`domain` 1:1 or are there differences?

Nico
_______________________________________________
coreboot mailing list -- [email protected]<mailto:[email protected]>
To unsubscribe send an email to 
[email protected]<mailto:[email protected]>
_______________________________________________
coreboot mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to