On Mon, 7 Aug 2017 21:00:04 +0800 Bob Chen <a175818...@gmail.com> wrote:
> Bad news... The performance had dropped dramatically when using emulated > switches. > > I was referring to the PCIe doc at > https://github.com/qemu/qemu/blob/master/docs/pcie.txt > > # qemu-system-x86_64_2.6.2 -enable-kvm -cpu host,kvm=off -machine > q35,accel=kvm -nodefaults -nodefconfig \ > -device ioh3420,id=root_port1,chassis=1,slot=1,bus=pcie.0 \ > -device x3130-upstream,id=upstream_port1,bus=root_port1 \ > -device > xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=11,slot=11 > \ > -device > xio3130-downstream,id=downstream_port2,bus=upstream_port1,chassis=12,slot=12 > \ > -device vfio-pci,host=08:00.0,multifunction=on,bus=downstream_port1 \ > -device vfio-pci,host=09:00.0,multifunction=on,bus=downstream_port2 \ > -device ioh3420,id=root_port2,chassis=2,slot=2,bus=pcie.0 \ > -device x3130-upstream,id=upstream_port2,bus=root_port2 \ > -device > xio3130-downstream,id=downstream_port3,bus=upstream_port2,chassis=21,slot=21 > \ > -device > xio3130-downstream,id=downstream_port4,bus=upstream_port2,chassis=22,slot=22 > \ > -device vfio-pci,host=89:00.0,multifunction=on,bus=downstream_port3 \ > -device vfio-pci,host=8a:00.0,multifunction=on,bus=downstream_port4 \ > ... > > > Not 8 GPUs this time, only 4. > > *1. Attached to pcie bus directly (former situation):* > > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) > D\D 0 1 2 3 > 0 420.93 10.03 11.07 11.09 > 1 10.04 425.05 11.08 10.97 > 2 11.17 11.17 425.07 10.07 > 3 11.25 11.25 10.07 423.64 > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) > D\D 0 1 2 3 > 0 425.98 10.03 11.07 11.09 > 1 9.99 426.43 11.07 11.07 > 2 11.04 11.20 425.98 9.89 > 3 11.21 11.21 10.06 425.97 > Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) > D\D 0 1 2 3 > 0 430.67 10.45 19.59 19.58 > 1 10.44 428.81 19.49 19.53 > 2 19.62 19.62 429.52 10.57 > 3 19.60 19.66 10.43 427.38 > Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) > D\D 0 1 2 3 > 0 429.47 10.47 19.52 19.39 > 1 10.48 427.15 19.64 19.52 > 2 19.64 19.59 429.02 10.42 > 3 19.60 19.64 10.47 427.81 > P2P=Disabled Latency Matrix (us) > D\D 0 1 2 3 > 0 4.50 13.72 14.49 14.44 > 1 13.65 4.53 14.52 14.33 > 2 14.22 13.82 4.52 14.50 > 3 13.87 13.75 14.53 4.55 > P2P=Enabled Latency Matrix (us) > D\D 0 1 2 3 > 0 4.44 13.56 14.58 14.45 > 1 13.56 4.48 14.39 14.45 > 2 13.85 13.93 4.86 14.80 > 3 14.51 14.23 14.70 4.72 > > > *2. Attached to emulated Root Port and Switches:* > > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) > D\D 0 1 2 3 > 0 420.48 3.15 3.12 3.12 > 1 3.13 422.31 3.12 3.12 > 2 3.08 3.09 421.40 3.13 > 3 3.10 3.10 3.13 418.68 > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) > D\D 0 1 2 3 > 0 418.68 3.14 3.12 3.12 > 1 3.15 420.03 3.12 3.12 > 2 3.11 3.10 421.39 3.14 > 3 3.11 3.08 3.13 419.13 > Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) > D\D 0 1 2 3 > 0 424.36 5.36 5.35 5.34 > 1 5.36 424.36 5.34 5.34 > 2 5.35 5.36 425.52 5.35 > 3 5.36 5.36 5.34 425.29 > Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) > D\D 0 1 2 3 > 0 422.98 5.35 5.35 5.35 > 1 5.35 423.44 5.34 5.33 > 2 5.35 5.35 425.29 5.35 > 3 5.35 5.34 5.34 423.21 > P2P=Disabled Latency Matrix (us) > D\D 0 1 2 3 > 0 4.79 16.59 16.38 16.22 > 1 16.62 4.77 16.35 16.69 > 2 16.77 16.66 4.03 16.68 > 3 16.54 16.56 16.78 4.08 > P2P=Enabled Latency Matrix (us) > D\D 0 1 2 3 > 0 4.51 16.56 16.58 16.66 > 1 15.65 3.87 16.74 16.61 > 2 16.59 16.81 3.96 16.70 > 3 16.47 16.28 16.68 4.03 > > > Is it because the heavy load of CPU emulation had caused a bottleneck? QEMU should really not be involved in the data flow, once the memory slots are configured in KVM, we really should not be exiting out to QEMU regardless of the topology. I wonder if it has something to do with the link speed/width advertised on the switch port. I don't think the endpoint can actually downshift the physical link, so lspci on the host should probably still show the full bandwidth capability, but maybe the driver is somehow doing rate limiting. PCIe gets a little more complicated as we go to newer versions, so it's not quite as simple as exposing a different bit configuration to advertise 8GT/s, x16. Last I tried to do link matching it was deemed too complicated for something I couldn't prove at the time had measurable value. This might be a good way to prove that value if it makes a difference here. I can't think why else you'd see such a performance difference, but testing to see if the KVM exit rate is significantly different could still be an interesting verification. Thanks, Alex