On Thu, 2016-10-13 at 17:05 +0300, Marcel Apfelbaum wrote:
> > +PCI EXPRESS GUIDELINES
> > +======================
> > +
> > +1. Introduction
> > +================
> > +The doc proposes best practices on how to use PCI Express/PCI device
> > +in PCI Express based machines and explains the reasoning behind them.
> > +
> > +
> > +2. Device placement strategy
> > +============================
> > +QEMU does not have a clear socket-device matching mechanism
> > +and allows any PCI/PCI Express device to be plugged into any PCI/PCI 
> > Express slot.
> > +Plugging a PCI device into a PCI Express slot might not always work and
> > +is weird anyway since it cannot be done for "bare metal".
> > +Plugging a PCI Express device into a PCI slot will hide the Extended
> > +Configuration Space thus is also not recommended.
> > +
> > +The recommendation is to separate the PCI Express and PCI hierarchies.
> > +PCI Express devices should be plugged only into PCI Express Root Ports and
> > +PCI Express Downstream ports.
> > +
> > +2.1 Root Bus (pcie.0)
> > +=====================
> > +Place only the following kinds of devices directly on the Root Complex:
> > +    (1) Devices with dedicated, specific functionality (network card,
> > +        graphics card, IDE controller, etc); place only legacy PCI devices 
> > on
> > +        the Root Complex. These will be considered Integrated Endpoints.
> > +        Note: Integrated devices are not hot-pluggable.

s/Integrated devices/Integrated Endpoints/ (which I assume
is a Spec-Originated Term) in the last sentence, to be
consistent with the one right before it.

I'm also not sure what you mean by devices with "dedicated,
specific functionality", and unfortunately the examples don't
seem to be helping me.

> > +        Although the PCI Express spec does not forbid PCI Express devices 
> > as
> > +        Integrated Endpoints, existing hardware mostly integrates legacy 
> > PCI
> > +        devices with the Root Complex. Guest OSes are suspected to behave
> > +        strangely when PCI Express devices are integrated with the Root 
> > Complex.
> > +
> > +    (2) PCI Express Root Ports (ioh3420), for starting exclusively PCI 
> > Express
> > +        hierarchies.
> > +
> > +    (3) DMI-PCI bridges (i82801b11-bridge), for starting legacy PCI 
> > hierarchies.
> > +
> > +    (4) Extra Root Complexes (pxb-pcie), if multiple PCIe Root Buses are 
> > needed.
> > +
> > +   pcie.0 bus
> > +   
> > -----------------------------------------------------------------------------
> > +        |                |                    |                  |
> > +   -----------   ------------------   ------------------   --------------
> > +   | PCI Dev |   | PCIe Root Port |   | DMI-PCI bridge |   |  pxb-pcie  |
> > +   -----------   ------------------   ------------------   --------------
> > +
> > +2.1.1 To plug a device into a pcie.0 as Root Complex Integrated Device use:

s/Root Complex Integrated Device/Integrated Endpoint/ ?

I don't know how much any of these terms can be used
interchangeably, but it would be good IMHO if we could choose
a single term and stick to it throughout the document.

> > +          -device <dev>[,bus=pcie.0]

Is the bus=pcie.0 bit really optional? Will QEMU just assign
devices to pcie.0 automatically unless you provide an explicit
bus= option telling it otherwise?

> > +2.1.2 To expose a new PCI Express Root Bus use:
> > +          -device pxb-pcie,id=pcie.1,bus_nr=x,[numa_node=y],[addr=z]
> > +      Only PCI Express Root Ports and DMI-PCI bridges can be connected to 
> > the pcie.1 bus:
> > +          -device 
> > ioh3420,id=root_port1[,bus=pcie.1][,chassis=x][,slot=y][,addr=z] \
> > +          -device i82801b11-bridge,id=dmi_pci_bridge1,bus=pcie.1

Here I really can't see how the bus= option would be optional
for ioh3420 but mandatory for i82801b11-bridge.

Neither for <dev> above nor for i82801b11-bridge you show
that the slot= option exists. Of course these are merely
usage examples and are not intended to replace the full
documentation: since this is the case, I think we should
make that very explicit and possibly avoid listing options
we don't use at all, eg.

  -device e1000,bus=pcie.0

  This will plug a e1000 Ethernet adapter into pcie.0 as an
  Integrated Endpoint.

and so on.

> > +2.2 PCI Express only hierarchy
> > +==============================
> > +Always use PCI Express Root Ports to start PCI Express hierarchies.
> > +
> > +A PCI Express Root bus supports up to 32 devices. Since each
> > +PCI Express Root Port is a function and a multi-function
> > +device may support up to 8 functions, the maximum possible
> > +PCI Express Root Ports per PCI Express Root Bus is 256.
> > +
> > +Prefer coupling PCI Express Root Ports into multi-function devices

s/coupling/grouping/

> > +to keep a simple flat hierarchy that is enough for most scenarios.
> > +Only use PCI Express Switches (x3130-upstream, xio3130-downstream)
> > +if there is no more room for PCI Express Root Ports.
> > +Please see section 4. for further justifications.
> > +
> > +Plug only PCI Express devices into PCI Express Ports.
> > +
> > +
> > +   pcie.0 bus
> > +   
> > ----------------------------------------------------------------------------------
> > +        |                 |                                    |
> > +   -------------    -------------                        -------------
> > +   | Root Port |    | Root Port |                        | Root Port |
> > +   ------------     -------------                        -------------
> > +         |                            
> > -------------------------|------------------------
> > +    ------------                      |                 -----------------  
> >             |
> > +    | PCIe Dev |                      |    PCI Express  | Upstream Port |  
> >             |
> > +    ------------                      |      Switch     -----------------  
> >             |
> > +                                      |                  |            |    
> >             |
> > +                                      |    -------------------    
> > -------------------  |
> > +                                      |    | Downstream Port |    | 
> > Downstream Port |  |
> > +                                      |    -------------------    
> > -------------------  |
> > +                                      
> > -------------|-----------------------|------------
> > +                                             ------------
> > +                                             | PCIe Dev |
> > +                                             ------------
> > +
> > +2.2.1 Plugging a PCI Express device into a PCI Express Root Port:
> > +          -device 
> > ioh3420,id=root_port1,chassis=x[,bus=pcie.0][,slot=y][,addr=z]  \
> > +          -device <dev>,bus=root_port1
> > +      Note that chassis parameter is compulsory, and must be unique

s/compulsory/mandatory/

> > +      for each PCI Express Root Port.
> > +2.2.2 Using multi-function PCI Express Root Ports:
> > +      -device 
> > ioh3420,id=root_port1,multifunction=on,chassis=x[,bus=pcie.0][,slot=y][,addr=z.0]
> >  \
> > +      -device 
> > ioh3420,id=root_port2,,chassis=x1[,bus=pcie.0][,slot=y1][,addr=z.1] \
> > +      -device 
> > ioh3420,id=root_port3,,chassis=x2[,bus=pcie.0][,slot=y2][,addr=z.2] \
> > +2.2.2 Plugging a PCI Express device into a Switch:
> > +      -device 
> > ioh3420,id=root_port1,chassis=x[,bus=pcie.0][,slot=y][,addr=z]  \
> > +      -device x3130-upstream,id=upstream_port1,bus=root_port1[,addr=x]     
> >      \
> > +      -device 
> > xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=x1[,slot=y1][,addr=z1]]
> >  \
> > +      -device <dev>,bus=downstream_port1
> > +
> > +
> > +2.3 PCI only hierarchy
> > +======================
> > +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices.

Maybe we could add something like

  but, as mentioned above, doing so means the legacy PCI
  device in question will be incapable of hot-unplugging.

just to stress the fact.

Aside: "device" or "endpoint"? My understanding is that
endpoints are things like Ethernet adapters, and devices
are endpoints *or* controller, but I might be way off base
here ;)

> > +Besides that use DMI-PCI bridges (i82801b11-bridge) to start PCI 
> > hierarchies.
> > +
> > +Prefer flat hierarchies. For most scenarios a single DMI-PCI bridge 
> > (having 32 slots)
> > +and several PCI-PCI bridges attached to it (each supporting also 32 slots) 
> > will support
> > +hundreds of legacy devices. The recommendation is to populate one PCI-PCI 
> > bridge
> > +under the DMI-PCI bridge until is full and then plug a new PCI-PCI 
> > bridge...
> > +
> > +   pcie.0 bus
> > +   ----------------------------------------------
> > +        |                            |
> > +   -----------               ------------------
> > +   | PCI Dev |               | DMI-PCI BRIDGE |
> > +   ----------                ------------------
> > +                               |            |
> > +                        -----------    ------------------
> > +                        | PCI Dev |    | PCI-PCI Bridge |
> > +                        -----------    ------------------
> > +                                         |           |
> > +                                  -----------     -----------
> > +                                  | PCI Dev |     | PCI Dev |
> > +                                  -----------     -----------
> > +
> > +2.3.1 To plug a PCI device into a pcie.0 as Integrated Device use:
> > +      -device <dev>[,bus=pcie.0]
> > +2.3.2 Plugging a PCI device into a DMI-PCI bridge:
> > +      -device i82801b11-bridge,id=dmi_pci_bridge1,[,bus=pcie.0]    \
> > +      -device <dev>,bus=dmi_pci_bridge1[,addr=x]
> > +2.3.3 Plugging a PCI device into a PCI-PCI bridge:
> > +      -device i82801b11-bridge,id=dmi_pci_bridge1,[,bus=pcie.0]            
> >             \
> > +      -device 
> > pci-bridge,id=pci_bridge1,bus=dmi_pci_bridge1[,chassis_nr=x][,addr=y]    \
> > +      -device <dev>,bus=pci_bridge1[,addr=x]
> > +
> > +
> > +3. IO space issues
> > +===================
> > +The PCI Express Root Ports and PCI Express Downstream ports are seen by
> > +Firmware/Guest OS as PCI-PCI bridges and, as required by PCI spec,
> > +should reserve a 4K IO range for each even if only one (multifunction)
> > +device can be plugged into them, resulting in poor IO space utilization.
> > +
> > +The firmware used by QEMU (SeaBIOS/OVMF) may try further optimizations
> > +by not allocating IO space if possible:
> > +    (1) - For empty PCI Express Root Ports/PCI Express Downstream ports.
> > +    (2) - If the device behind the PCI Express Root Ports/PCI Express
> > +          Downstream has no IO BARs.
> > +
> > +The IO space is very limited, 65536 byte-wide IO ports, but it's fragmented
> > +resulting in ~10 PCI Express Root Ports (or PCI Express 
> > Downstream/Upstream ports)
> > +ports per system if devices with IO BARs are used in the PCI Express 
> > hierarchy.
> > +
> > +Using the proposed device placing strategy solves this issue
> > +by using only PCI Express devices within PCI Express hierarchy.
> > +
> > +The PCI Express spec requires the PCI Express devices to work without 
> > using IO.
> > +The PCI hierarchy has no such limitations.

After reading this section and piecing all bits together, my
understanding is that

  a) the spec requires firmware and OS to allocate 4K of IO
     space for each PCI Express Root Port or PCI Express
     Switch Downstream Port

  b) that's necessary because legacy PCI Devices require IO
     space to operate

  c) however, IO space is a very limited resource

  d) the spec also requires PCI Express Devices (or rather
     Endpoints?) to work without using any IO space

  e) if we stick to the plan outlined in this document, there
     will never be legacy PCI Devices plugged into PCI Express
     Root Ports or PCI Express Switch Downstream Ports (only
     PCI Express Endpoints or PCI Express Switch Upstream
     Ports)

  f) thanks to d) and e), the firmware (and OS? IIRC Linux
     currently doesn't do this) can ignore a) and allocate no
     IO space for PCI Express Root Ports and PCI Express
     Switch Downstream Ports

  g) thus effectively making c) irrelevant

If my reading is correct, I think we should put more emphasis
on the fact that our device placement strategy is effective
in avoiding IO space exaustion because we stick to PCI Express
Devices only which, unlike legacy PCI Devices, require no IO
space to operate.

> > +4. Bus numbers issues
> > +======================
> > +Each PCI domain can have up to only 256 buses and the QEMU PCI Express
> > +machines do not support multiple PCI domains even if extra Root
> > +Complexes (pxb-pcie) are used.
> > +
> > +Each element of the PCI Express hierarchy (Root Complexes,
> > +PCI Express Root Ports, PCI Express Downstream/Upstream ports)
> > +takes up bus numbers. Since only one (multifunction) device
> > +can be attached to a PCI Express Root Port or PCI Express Downstream
> > +Port it is advised to plan in advance for the expected number of
> > +devices to prevent bus numbers starvation.
> > +
> > +
> > +5. Hot Plug

Hot Plug, hot-plug or hotplug?

Pick one and stick with it :)

> > +============
> > +The PCI Express root buses (pcie.0 and the buses exposed by pxb-pcie 
> > devices)
> > +do not support hot-plug, so any devices plugged into Root Complexes
> > +cannot be hot-plugged/hot-unplugged:
> > +    (1) PCI Express Integrated Devices
> > +    (2) PCI Express Root Ports
> > +    (3) DMI-PCI bridges
> > +    (4) pxb-pcie
> > +
> > +PCI devices can be hot-plugged into PCI-PCI bridges, however cannot
> > +be hot-plugged into DMI-PCI bridges.
> > +The PCI hotplug is ACPI based and can work side by side with the
> > +PCI Express native hotplug.
> > +
> > +PCI Express devices can be natively hot-plugged/hot-unplugged into/from
> > +PCI Express Root Ports (and PCI Express Downstream Ports).
> > +
> > +5.1 Planning for hotplug:
> > +    (1) PCI hierarchy
> > +        Leave enough PCI-PCI bridge slots empty or add one
> > +        or more empty PCI-PCI bridges to the DMI-PCI bridge.
> > +
> > +        For each such bridge the Guest Firmware is expected to reserve 4K 
> > IO
> > +        space and 2M MMIO range to be used for all devices behind it.
> > +
> > +        Because of the hard IO limit of around 10 PCI bridges (~ 40K 
> > space) per system
> > +        don't use more than 9 bridges, leaving 4K for the Integrated 
> > devices
> > +        and none for the PCI Express Hierarchy.

I would rewrite the last line as

  (the PCI Express hierarchy needs no IO space).

or something like that. Unless I misunderstood :)

> > +    (2) PCI Express hierarchy:
> > +        Leave enough PCI Express Root Ports empty. Use multifunction
> > +        PCI Express Root Ports to prevent going out of PCI bus numbers.

When you say "multifunction PCI Express Root Ports", are you
suggesting the user should take advantage of all 8 functions
in a PCI Express Root Port, eg. by plugging 4 Ethernet
adapters into the same PCI Express Root Port instead of using
a single PCI Express Root Port for each one of them, or that
multiple PCI Express Root Ports should be plugged into
functions of the same pcie.0 slot?

Either way, I don't see how doing so would prevent you from
running out of PCI bus numbers - if anything, not doing the
latter will make it so you'll run out of ports way before
you come anywhere close to 256 PCI buses.

Point is, being less terse here could be very helpful.

> > +        Don't use PCI Express Switches if you don't have too, they use
> > +        an extra PCI bus that may handy to plug another device id it comes 
> > to it.

s/too/to/
s/may handy/may come handy/
s/id it comes/if it comes/

> > +5.3 Hot plug example:
> > +Using HMP: (add -monitor stdio to QEMU command line)
> > +  device_add <dev>,id=<id>,bus=<pcie.0/PCI Express Root Port Id/PCI-PCI 
> > bridge Id/pxb-pcie Id>
> > +
> > +
> > +6. Device assignment
> > +====================
> > +Host devices are mostly PCI Express and should be plugged only into
> > +PCI Express Root Ports or PCI Express Downstream Ports.
> > +PCI-PCI bridge slots can be used for legacy PCI host devices.
> > +
> > +6.1 How to detect if a device is PCI Express:
> > +  > lspci -s 03:00.0 -v (as root)
> > +
> > +    03:00.0 Network controller: Intel Corporation Wireless 7260 (rev 83)
> > +    Subsystem: Intel Corporation Dual Band Wireless-AC 7260
> > +    Flags: bus master, fast devsel, latency 0, IRQ 50
> > +    Memory at f0400000 (64-bit, non-prefetchable) [size=8K]
> > +    Capabilities: [c8] Power Management version 3
> > +    Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> > +    Capabilities: [40] Express Endpoint, MSI 00
> > +

I'd skip this empty line...

> > +    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +

... and this one too.

> > +    Capabilities: [100] Advanced Error Reporting
> > +    Capabilities: [140] Device Serial Number 7c-7a-91-ff-ff-90-db-20
> > +    Capabilities: [14c] Latency Tolerance Reporting
> > +    Capabilities: [154] Vendor Specific Information: ID=cafe Rev=1 Len=014

I'd also add a small note along the lines of

  If you can see the "Express Endpoint" capability in the
  output, then the device is indeed PCI Express.

> > +7. Virtio devices
> > +=================
> > +Virtio devices plugged into the PCI hierarchy or as Integrated Devices
> > +will remain PCI and have transitional behaviour as default.
> > +Transitional virtio devices work in both IO and MMIO modes depending on
> > +the guest support.
> > +
> > +Virtio devices plugged into PCI Express ports are PCI Express devices and
> > +have "1.0" behavior by default without IO support.
> > +In both case disable-* properties can be used to override the behaviour.
> > +
> > +Note that setting disable-legacy=off will enable legacy mode (enabling
> > +legacy behavior) for PCI Express virtio devices causing them to
> > +require IO space, which, given our PCI Express hierarchy, may quickly
> > +lead to resource exhaustion, and is therefore strongly discouraged.
> > +
> > +
> > +8. Conclusion
> > +==============
> > +The proposal offers a usage model that is easy to understand and follow
> > +and in the same time overcomes the PCI Express architecture limitations.

s/in the same/at the same/


Overall I think the contents we care about are pretty much
there and are exposed and organized in a very clear fashion;
pretty much all of my comments are meant to improve the few
areas where I believe we could make things even easier to
grasp for the reader, remove potential ambiguity, or be more
consistent.

Thanks for working on this! Looking forward to v3 ;)

-- 
Andrea Bolognani / Red Hat / Virtualization

Reply via email to