Plus:

1 GB hugepages neither improved bandwidth nor latency. Results remained the
same.

2017-08-08 9:44 GMT+08:00 Bob Chen <a175818...@gmail.com>:

> 1. How to test the KVM exit rate?
>
> 2. The switches are separate devices of PLX Technology
>
> # lspci -s 07:08.0 -nn
> 07:08.0 PCI bridge [0604]: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port
> PCI Express Gen 3 (8.0 GT/s) Switch [10b5:8747] (rev ca)
>
> # This is one of the Root Ports in the system.
> [0000:00]-+-00.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon
> D DMI2
>              +-01.0-[01]----00.0  LSI Logic / Symbios Logic MegaRAID SAS
> 2208 [Thunderbolt]
>              +-02.0-[02-05]--
>              +-03.0-[06-09]----00.0-[07-09]--+-08.0-[08]--+-00.0  NVIDIA
> Corporation GP102 [TITAN Xp]
>              |                               |            \-00.1  NVIDIA
> Corporation GP102 HDMI Audio Controller
>              |                               \-10.0-[09]--+-00.0  NVIDIA
> Corporation GP102 [TITAN Xp]
>              |                                            \-00.1  NVIDIA
> Corporation GP102 HDMI Audio Controller
>
>
>
>
> 3. ACS
>
> It seemed that I had misunderstood your point? I finally found ACS
> information on switches, not on GPUs.
>
> Capabilities: [f24 v1] Access Control Services
> ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+
> EgressCtrl+ DirectTrans+
> ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+
> EgressCtrl- DirectTrans-
>
>
>
> 2017-08-07 23:52 GMT+08:00 Alex Williamson <alex.william...@redhat.com>:
>
>> On Mon, 7 Aug 2017 21:00:04 +0800
>> Bob Chen <a175818...@gmail.com> wrote:
>>
>> > Bad news... The performance had dropped dramatically when using emulated
>> > switches.
>> >
>> > I was referring to the PCIe doc at
>> > https://github.com/qemu/qemu/blob/master/docs/pcie.txt
>> >
>> > # qemu-system-x86_64_2.6.2 -enable-kvm -cpu host,kvm=off -machine
>> > q35,accel=kvm -nodefaults -nodefconfig \
>> > -device ioh3420,id=root_port1,chassis=1,slot=1,bus=pcie.0 \
>> > -device x3130-upstream,id=upstream_port1,bus=root_port1 \
>> > -device
>> > xio3130-downstream,id=downstream_port1,bus=upstream_port1,
>> chassis=11,slot=11
>> > \
>> > -device
>> > xio3130-downstream,id=downstream_port2,bus=upstream_port1,
>> chassis=12,slot=12
>> > \
>> > -device vfio-pci,host=08:00.0,multifunction=on,bus=downstream_port1 \
>> > -device vfio-pci,host=09:00.0,multifunction=on,bus=downstream_port2 \
>> > -device ioh3420,id=root_port2,chassis=2,slot=2,bus=pcie.0 \
>> > -device x3130-upstream,id=upstream_port2,bus=root_port2 \
>> > -device
>> > xio3130-downstream,id=downstream_port3,bus=upstream_port2,
>> chassis=21,slot=21
>> > \
>> > -device
>> > xio3130-downstream,id=downstream_port4,bus=upstream_port2,
>> chassis=22,slot=22
>> > \
>> > -device vfio-pci,host=89:00.0,multifunction=on,bus=downstream_port3 \
>> > -device vfio-pci,host=8a:00.0,multifunction=on,bus=downstream_port4 \
>> > ...
>> >
>> >
>> > Not 8 GPUs this time, only 4.
>> >
>> > *1. Attached to pcie bus directly (former situation):*
>> >
>> > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
>> >    D\D     0      1      2      3
>> >      0 420.93  10.03  11.07  11.09
>> >      1  10.04 425.05  11.08  10.97
>> >      2  11.17  11.17 425.07  10.07
>> >      3  11.25  11.25  10.07 423.64
>> > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>> >    D\D     0      1      2      3
>> >      0 425.98  10.03  11.07  11.09
>> >      1   9.99 426.43  11.07  11.07
>> >      2  11.04  11.20 425.98   9.89
>> >      3  11.21  11.21  10.06 425.97
>> > Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
>> >    D\D     0      1      2      3
>> >      0 430.67  10.45  19.59  19.58
>> >      1  10.44 428.81  19.49  19.53
>> >      2  19.62  19.62 429.52  10.57
>> >      3  19.60  19.66  10.43 427.38
>> > Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>> >    D\D     0      1      2      3
>> >      0 429.47  10.47  19.52  19.39
>> >      1  10.48 427.15  19.64  19.52
>> >      2  19.64  19.59 429.02  10.42
>> >      3  19.60  19.64  10.47 427.81
>> > P2P=Disabled Latency Matrix (us)
>> >    D\D     0      1      2      3
>> >      0   4.50  13.72  14.49  14.44
>> >      1  13.65   4.53  14.52  14.33
>> >      2  14.22  13.82   4.52  14.50
>> >      3  13.87  13.75  14.53   4.55
>> > P2P=Enabled Latency Matrix (us)
>> >    D\D     0      1      2      3
>> >      0   4.44  13.56  14.58  14.45
>> >      1  13.56   4.48  14.39  14.45
>> >      2  13.85  13.93   4.86  14.80
>> >      3  14.51  14.23  14.70   4.72
>> >
>> >
>> > *2. Attached to emulated Root Port and Switches:*
>> >
>> > Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
>> >    D\D     0      1      2      3
>> >      0 420.48   3.15   3.12   3.12
>> >      1   3.13 422.31   3.12   3.12
>> >      2   3.08   3.09 421.40   3.13
>> >      3   3.10   3.10   3.13 418.68
>> > Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>> >    D\D     0      1      2      3
>> >      0 418.68   3.14   3.12   3.12
>> >      1   3.15 420.03   3.12   3.12
>> >      2   3.11   3.10 421.39   3.14
>> >      3   3.11   3.08   3.13 419.13
>> > Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
>> >    D\D     0      1      2      3
>> >      0 424.36   5.36   5.35   5.34
>> >      1   5.36 424.36   5.34   5.34
>> >      2   5.35   5.36 425.52   5.35
>> >      3   5.36   5.36   5.34 425.29
>> > Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>> >    D\D     0      1      2      3
>> >      0 422.98   5.35   5.35   5.35
>> >      1   5.35 423.44   5.34   5.33
>> >      2   5.35   5.35 425.29   5.35
>> >      3   5.35   5.34   5.34 423.21
>> > P2P=Disabled Latency Matrix (us)
>> >    D\D     0      1      2      3
>> >      0   4.79  16.59  16.38  16.22
>> >      1  16.62   4.77  16.35  16.69
>> >      2  16.77  16.66   4.03  16.68
>> >      3  16.54  16.56  16.78   4.08
>> > P2P=Enabled Latency Matrix (us)
>> >    D\D     0      1      2      3
>> >      0   4.51  16.56  16.58  16.66
>> >      1  15.65   3.87  16.74  16.61
>> >      2  16.59  16.81   3.96  16.70
>> >      3  16.47  16.28  16.68   4.03
>> >
>> >
>> > Is it because the heavy load of CPU emulation had caused a bottleneck?
>>
>> QEMU should really not be involved in the data flow, once the memory
>> slots are configured in KVM, we really should not be exiting out to
>> QEMU regardless of the topology.  I wonder if it has something to do
>> with the link speed/width advertised on the switch port.  I don't think
>> the endpoint can actually downshift the physical link, so lspci on the
>> host should probably still show the full bandwidth capability, but
>> maybe the driver is somehow doing rate limiting.  PCIe gets a little
>> more complicated as we go to newer versions, so it's not quite as
>> simple as exposing a different bit configuration to advertise 8GT/s,
>> x16.  Last I tried to do link matching it was deemed too complicated
>> for something I couldn't prove at the time had measurable value.  This
>> might be a good way to prove that value if it makes a difference here.
>> I can't think why else you'd see such a performance difference, but
>> testing to see if the KVM exit rate is significantly different could
>> still be an interesting verification.  Thanks,
>>
>> Alex
>>
>
>

Reply via email to