Re: [Qemu-devel] Qemu and 32 PCIe devices

Laszlo Ersek Tue, 08 Aug 2017 08:51:56 -0700

On 08/08/17 12:39, Marcin Juszkiewicz wrote:
>
> Few days ago I had an issue with getting PCIe hotplug working on
> AArch64 machine. Enabled PCI hotplug in kernel and then got hit by
> some issues.
>
> Out setup is a bunch of aarch64 servers and we use OpenStack to
> provide access to arm64 systems. OpenStack uses libvirt to control VMs
> and allows to add network interfaces and disk volumes to running
> systems.
>
> By libvirt AArch64 is treated as PCIe machine without legacy PCI
> slots. So to hotplug anything you first need to have enough
> pcie-root-port entries as it is described in Qemu docs/pcie.txt and by
> patch to libvirt documentation [1][2].
>
> 1. https://bugs.linaro.org/attachment.cgi?id=782
> 2. https://www.redhat.com/archives/libvir-list/2017-July/msg01033.html
>
>
> But things get complicated once you are going to 32 PCIe devices limit
> (which in our setup will rather not happen). UEFI first takes ages to
> boot just to land in UEFI shell as it forgot all PCIe devices. With 31
> devices it boots (also after long time).


OK, let me quote my earlier off-list followup (so that the discussion
proceed publicly -- hopefully other thread participants will also repeat
their off-list messages, and/or clarify my conveying them):

On 08/07/17 19:32, Laszlo Ersek wrote:
>
> (1) Everything that's being worked out right now for PCI Express
> hotplug on Q35 (x86) applies equally to aarch64. Meaning, hotplug
> oriented aperture reservations for bus numbers, "IO ports", and
> various types of MMIO.
>
> This is to say that resource reservation is not a done deal (it's
> being designed) for x86 even, and it will take changes for both QEMU
> and OVMF. In OVMF, the relevant driver is "OvmfPkg/PciHotPlugInitDxe".
> Once we implement the necessary logic there (using a new
> "communication channel" with QEMU), then the driver should be possible
> to include in the ArmVirtQemu builds as well.
>
> Marcel, can you please provide pointers to the qemu-devel and seabios
> mailing list discussions?

(Marcel provided the following links, "IO/MEM/Bus reservation hints":

 SeaBIOS:
   - https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg06289.html
   - https://www.mail-archive.com/qemu-devel@nongnu.org/msg468550.html
   - https://www.mail-archive.com/qemu-devel@nongnu.org/msg470584.html
 QEMU:
   - https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg07110.html
   - https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg09157.html
)

>
> The OVMF (and AAVMF) RHBZs for adhering to QEMU's IO and MMIO aperture
> hints are:
>
> - https://bugzilla.redhat.com/show_bug.cgi?id=1434740
> - https://bugzilla.redhat.com/show_bug.cgi?id=1434747
>
> The prerequisite QEMU RHBZs are:
> - https://bugzilla.redhat.com/show_bug.cgi?id=1344299
> - https://bugzilla.redhat.com/show_bug.cgi?id=1437113
>
> In addition, there are some issues with edk2's generic PciBusDxe as
> well; for example ATM it ignores bus number reservations for the kinds
> of hotplug controllers that we care about:
>
> https://bugzilla.tianocore.org/show_bug.cgi?id=656
>
> I got a promise on edk2-devel from Ruiyu Ni (PciBusDxe maintainer from
> Intel) that he'd fix the issue at some point, but it's "not high
> priority".
>
>
> (2) As discussed earlier, the aarch64 "virt" machine type has a
> serious limitation relative to Q35: the former's MMCONFIG space is so
> small that it allows for only 16 buses. Each PCI Express root port and
> downstream port uses up a separate bus number. Please re-check
> "docs/pcie.txt" in the QEMU source tree with particular attention to
> section "4. Bus numbers issues" and "5. Hot-plug".
>
> When the edk2 PciBusDxe driver runs out of any kind of aperture (such
> as bus numbers, for example), it starts dropping devices. In general
> it prefers to drop devices with the largest resource consumptions (so
> that it has to drop the fewest devices for enabling the rest to work).
> In case many "uniform" devices are used, it is unspecified which ones
> get dropped. This can easily lead to boot failures (if the dropped
> devices inlcude the one(s) you wanted to boot off of).
>
> (3) The edk2 boot performance when using a large number of PCI
> (Express) devices is indeed really bad. I've looked into it earlier
> (on Q35), briefly, see for example:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1441550#c11

(Here Marcel referenced the following message, also reporting about slow
PCI enumeration:
<https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg07182.html>.
In this case the issue hit the guest kernel; OVMF was not used (and
SeaBIOS was unaffected).)

> As you can see, platform-independent, NIC enumeration-related code in
> edk2 is *really heavy* on UEFI variables. If you had a physical
> machine with lots of pflash, and tens (or hundreds) of NICs, the perf
> would suffer mostly the same.
>
> Anyway, beyond the things written in that comment, there is one very
> interesting symptom that makes me think another (milder?) bottleneck
> could be in QEMU:
>
> When having a large number of PCI(e) devices (to the tune of 100+),
> even the printing of DEBUG messages slows down extremely, to the point
> where I can basically follow, with the naked eye, the printing of
> *individual* characters, on the QEMU debug port. (Obviously such
> printing is unrelated to PCI devices; the QEMU debug port is a simple
> platform device on Q35 and i440fx). This suggests to me that the high
> number of MemoryRegion objects in QEMU, used for the BARs of PCI(e)
> devices and bridges, could slow down the dispatching of the individual
> IO or MMIO accesses. I don't know what data structure QEMU uses for
> representing the "flat view" of the address sapce, but I think it
> *could* be a bottleneck. At least I tried to profile a few bits in the
> firmware, and found nothing related specifically to the slowdown of
> DEBUG prints.

(Paolo says the data structure is a radix tree, so the bottleneck being
there would be surprising. Also Paolo has given me tips for profiling,
so I'm looking into it.)

(Another remark from Paolo, paraphrased: programming the BARs relies on
an O(n^3) algorithm, fixing which is on the todo list.)

>
> In summary:
> - all of the hotplug stuff is still under design / in flux even for
>   x86
> - MMCONFIG of "virt" is too small (too few bus numbers available)
> - the boot perf issue might need even KVM tracing (this is not to say
>   that the UEFI variable massaging during edk2 NIC binding isn't
>   resource hungry! See again RHBZ#1441550 comment 11.)

(Drew added a few more points about mach-virt here; I'll let him repeat
himself.)

Thanks,
Laszlo

Re: [Qemu-devel] Qemu and 32 PCIe devices

Reply via email to