I'm examining performance between two VMs connected with a vhost interface on DPDK 20.08 and testpmd. Each VM (client-0, server-0) has 4 VCPUs, 4 RX/TX queues per port, 4GB RAM, and runs 8 containers, each with an instance of qperf running the tcp_bw test. The configuration is targeting all CPU/memory activity for NUMA node 1.

When I look at the cumulative throughput as I increase the number of qperf pairs I'm noticing that the performance doesn't appear to scale as I had hoped. Here's a table with some results:

                    concurrent qperf pairs
msg_size     1           2           4           8
8,192    12.74 Gb/s  21.68 Gb/s  27.89 Gb/s  30.94 Gb/s
16,384   13.84 Gb/s  24.06 Gb/s  28.51 Gb/s  30.47 Gb/s
32,768   16.13 Gb/s  24.49 Gb/s  28.89 Gb/s  30.23 Gb/s
65,536   16.19 Gb/s  22.53 Gb/s  29.79 Gb/s  30.46 Gb/s
131,072  15.37 Gb/s  23.89 Gb/s  29.65 Gb/s  30.88 Gb/s
262,144  14.73 Gb/s  22.97 Gb/s  29.54 Gb/s  31.28 Gb/s
524,288  14.62 Gb/s  23.39 Gb/s  28.70 Gb/s  30.98 Gb/s

Can anyone suggest a possible configuration change that might improve performance or is this generally what is expected? I was expecting performance to nearly double as I move from 1 to 2 to 4 queues.

Even single queue performance is below Intel's published performance results (see https://fast.dpdk.org/doc/perf/DPDK_20_05_Intel_virtio_performance_report.pdf), though I was unable to get the vhost-switch example application to run due to an mbuf allocation error for the i40e PMD and had to revert to the testpmd app.

Configuration details below.

Dave

/proc/cmdline:
--------------
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-147.el8.x86_64 root=/dev/mapper/rhel-root ro intel_iommu=on iommu=pt default_hugepagesz=1G hugepagesz=1G hugepages=64 crashkernel=auto resume=/dev/mapper/rhel-swap rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet =1 nohz=on nohz_full=8-15,24-31 rcu_nocbs=8-15,24-31 tuned.non_isolcpus=00ff00ff intel_pstate=disable nosoftlockup

testpmd command-line:
---------------------
~/src/dpdk/build/app/dpdk-testpmd -l 7,24-31 -n 4 --no-pci --vdev 'net_vhost0,iface=/tmp/vhost-dpdk-server-0,dequeue-zero-copy=1,tso=1,queues=4' --vdev 'net_vhost1,iface=/tmp/vhost-dpdk-client-0,dequeue-zero-copy=1,tso=1,queues=4' -- -i --nb-cores=8 --numa --rxq=4 --txq=4

testpmd forwarding core mapping:
--------------------------------
Start automatic packet forwarding
io packet forwarding - ports=2 - cores=8 - streams=8 - NUMA support enabled, MP allocation mode: native
Logical Core 24 (socket 1) forwards packets on 1 streams:
  RX P=0/Q=0 (socket 0) -> TX P=1/Q=0 (socket 0) peer=02:00:00:00:00:01
Logical Core 25 (socket 1) forwards packets on 1 streams:
  RX P=1/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00
Logical Core 26 (socket 1) forwards packets on 1 streams:
  RX P=0/Q=1 (socket 0) -> TX P=1/Q=1 (socket 0) peer=02:00:00:00:00:01
Logical Core 27 (socket 1) forwards packets on 1 streams:
  RX P=1/Q=1 (socket 0) -> TX P=0/Q=1 (socket 0) peer=02:00:00:00:00:00
Logical Core 28 (socket 1) forwards packets on 1 streams:
  RX P=0/Q=2 (socket 0) -> TX P=1/Q=2 (socket 0) peer=02:00:00:00:00:01
Logical Core 29 (socket 1) forwards packets on 1 streams:
  RX P=1/Q=2 (socket 0) -> TX P=0/Q=2 (socket 0) peer=02:00:00:00:00:00
Logical Core 30 (socket 1) forwards packets on 1 streams:
  RX P=0/Q=3 (socket 0) -> TX P=1/Q=3 (socket 0) peer=02:00:00:00:00:01
Logical Core 31 (socket 1) forwards packets on 1 streams:
  RX P=1/Q=3 (socket 0) -> TX P=0/Q=3 (socket 0) peer=02:00:00:00:00:00

  io packet forwarding packets/burst=32
  nb forwarding cores=8 - nb forwarding ports=2
  port 0: RX queue number: 4 Tx queue number: 4
    Rx offloads=0x0 Tx offloads=0x0
    RX queue: 0
      RX desc=0 - RX free threshold=0
      RX threshold registers: pthresh=0 hthresh=0  wthresh=0
      RX Offloads=0x0
    TX queue: 0
      TX desc=0 - TX free threshold=0
      TX threshold registers: pthresh=0 hthresh=0  wthresh=0
      TX offloads=0x0 - TX RS bit threshold=0
  port 1: RX queue number: 4 Tx queue number: 4
    Rx offloads=0x0 Tx offloads=0x0
    RX queue: 0
      RX desc=0 - RX free threshold=0
      RX threshold registers: pthresh=0 hthresh=0  wthresh=0
      RX Offloads=0x0
    TX queue: 0
      TX desc=0 - TX free threshold=0
      TX threshold registers: pthresh=0 hthresh=0  wthresh=0
      TX offloads=0x0 - TX RS bit threshold=0

lscpu:
------
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
Stepping:            4
CPU MHz:             2400.075
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            11264K
NUMA node0 CPU(s):   0-7,16-23
NUMA node1 CPU(s):   8-15,24-31

server-0 libvirt XML:
---------------------
...
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size='1048576' unit='KiB' nodeset='0'/>
    </hugepages>
  </memoryBacking>
  <vcpu placement='static' cpuset='8-11'>4</vcpu>
  <numatune>
    <memory mode='strict' nodeset='1'/>
  </numatune>
  <os>
    <type arch='x86_64' machine='pc-q35-rhel8.2.0'>hvm</type>
  </os>
  <features>
    <acpi/>
    <apic/>
    <vmport state='off'/>
  </features>
  <cpu mode='host-passthrough' check='none'>
    <numa>
<cell id='0' cpus='0-3' memory='4194304' unit='KiB' memAccess='shared'/>
    </numa>
  </cpu>
...
    <interface type='vhostuser'>
      <mac address='52:54:00:2a:fc:10'/>
      <source type='unix' path='/tmp/vhost-dpdk-server-0' mode='client'/>
      <model type='virtio'/>
      <driver name='vhost' queues='4' rx_queue_size='1024'>
        <host mrg_rxbuf='on'/>
      </driver>
<address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>
...


Reply via email to