Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】

Alex Williamson Mon, 07 Aug 2017 09:03:37 -0700

On Mon, 7 Aug 2017 21:00:04 +0800
Bob Chen <a175818...@gmail.com> wrote:


> Bad news... The performance had dropped dramatically when using emulated
> switches.
> 
> I was referring to the PCIe doc at
> https://github.com/qemu/qemu/blob/master/docs/pcie.txt
> 
> # qemu-system-x86_64_2.6.2 -enable-kvm -cpu host,kvm=off -machine
> q35,accel=kvm -nodefaults -nodefconfig \
> -device ioh3420,id=root_port1,chassis=1,slot=1,bus=pcie.0 \
> -device x3130-upstream,id=upstream_port1,bus=root_port1 \
> -device
> xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=11,slot=11
> \
> -device
> xio3130-downstream,id=downstream_port2,bus=upstream_port1,chassis=12,slot=12
> \
> -device vfio-pci,host=08:00.0,multifunction=on,bus=downstream_port1 \
> -device vfio-pci,host=09:00.0,multifunction=on,bus=downstream_port2 \
> -device ioh3420,id=root_port2,chassis=2,slot=2,bus=pcie.0 \
> -device x3130-upstream,id=upstream_port2,bus=root_port2 \
> -device
> xio3130-downstream,id=downstream_port3,bus=upstream_port2,chassis=21,slot=21
> \
> -device
> xio3130-downstream,id=downstream_port4,bus=upstream_port2,chassis=22,slot=22
> \
> -device vfio-pci,host=89:00.0,multifunction=on,bus=downstream_port3 \
> -device vfio-pci,host=8a:00.0,multifunction=on,bus=downstream_port4 \
> ...
> 
> 
> Not 8 GPUs this time, only 4.
> 
> *1. Attached to pcie bus directly (former situation):*
> 
> Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
>    D\D     0      1      2      3
>      0 420.93  10.03  11.07  11.09
>      1  10.04 425.05  11.08  10.97
>      2  11.17  11.17 425.07  10.07
>      3  11.25  11.25  10.07 423.64
> Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>    D\D     0      1      2      3
>      0 425.98  10.03  11.07  11.09
>      1   9.99 426.43  11.07  11.07
>      2  11.04  11.20 425.98   9.89
>      3  11.21  11.21  10.06 425.97
> Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
>    D\D     0      1      2      3
>      0 430.67  10.45  19.59  19.58
>      1  10.44 428.81  19.49  19.53
>      2  19.62  19.62 429.52  10.57
>      3  19.60  19.66  10.43 427.38
> Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>    D\D     0      1      2      3
>      0 429.47  10.47  19.52  19.39
>      1  10.48 427.15  19.64  19.52
>      2  19.64  19.59 429.02  10.42
>      3  19.60  19.64  10.47 427.81
> P2P=Disabled Latency Matrix (us)
>    D\D     0      1      2      3
>      0   4.50  13.72  14.49  14.44
>      1  13.65   4.53  14.52  14.33
>      2  14.22  13.82   4.52  14.50
>      3  13.87  13.75  14.53   4.55
> P2P=Enabled Latency Matrix (us)
>    D\D     0      1      2      3
>      0   4.44  13.56  14.58  14.45
>      1  13.56   4.48  14.39  14.45
>      2  13.85  13.93   4.86  14.80
>      3  14.51  14.23  14.70   4.72
> 
> 
> *2. Attached to emulated Root Port and Switches:*
> 
> Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
>    D\D     0      1      2      3
>      0 420.48   3.15   3.12   3.12
>      1   3.13 422.31   3.12   3.12
>      2   3.08   3.09 421.40   3.13
>      3   3.10   3.10   3.13 418.68
> Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>    D\D     0      1      2      3
>      0 418.68   3.14   3.12   3.12
>      1   3.15 420.03   3.12   3.12
>      2   3.11   3.10 421.39   3.14
>      3   3.11   3.08   3.13 419.13
> Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
>    D\D     0      1      2      3
>      0 424.36   5.36   5.35   5.34
>      1   5.36 424.36   5.34   5.34
>      2   5.35   5.36 425.52   5.35
>      3   5.36   5.36   5.34 425.29
> Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
>    D\D     0      1      2      3
>      0 422.98   5.35   5.35   5.35
>      1   5.35 423.44   5.34   5.33
>      2   5.35   5.35 425.29   5.35
>      3   5.35   5.34   5.34 423.21
> P2P=Disabled Latency Matrix (us)
>    D\D     0      1      2      3
>      0   4.79  16.59  16.38  16.22
>      1  16.62   4.77  16.35  16.69
>      2  16.77  16.66   4.03  16.68
>      3  16.54  16.56  16.78   4.08
> P2P=Enabled Latency Matrix (us)
>    D\D     0      1      2      3
>      0   4.51  16.56  16.58  16.66
>      1  15.65   3.87  16.74  16.61
>      2  16.59  16.81   3.96  16.70
>      3  16.47  16.28  16.68   4.03
> 
> 
> Is it because the heavy load of CPU emulation had caused a bottleneck?

QEMU should really not be involved in the data flow, once the memory
slots are configured in KVM, we really should not be exiting out to
QEMU regardless of the topology.  I wonder if it has something to do
with the link speed/width advertised on the switch port.  I don't think
the endpoint can actually downshift the physical link, so lspci on the
host should probably still show the full bandwidth capability, but
maybe the driver is somehow doing rate limiting.  PCIe gets a little
more complicated as we go to newer versions, so it's not quite as
simple as exposing a different bit configuration to advertise 8GT/s,
x16.  Last I tried to do link matching it was deemed too complicated
for something I couldn't prove at the time had measurable value.  This
might be a good way to prove that value if it makes a difference here.
I can't think why else you'd see such a performance difference, but
testing to see if the KVM exit rate is significantly different could
still be an interesting verification.  Thanks,

Alex

Re: [Qemu-devel] About virtio device hotplug in Q35! 【外域邮件.谨慎查阅】

Reply via email to