On 9/8/23 13:48, Daniel P. Berrangé wrote: > On Fri, Sep 08, 2023 at 02:45:02PM +0800, Jason Wang wrote: >> From: Ilya Maximets <i.maxim...@ovn.org> >> >> AF_XDP is a network socket family that allows communication directly >> with the network device driver in the kernel, bypassing most or all >> of the kernel networking stack. In the essence, the technology is >> pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native >> and works with any network interfaces without driver modifications. >> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't >> require access to character devices or unix sockets. Only access to >> the network interface itself is necessary. >> >> This patch implements a network backend that communicates with the >> kernel by creating an AF_XDP socket. A chunk of userspace memory >> is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, >> Fill and Completion) are placed in that memory along with a pool of >> memory buffers for the packet data. Data transmission is done by >> allocating one of the buffers, copying packet data into it and >> placing the pointer into Tx ring. After transmission, device will >> return the buffer via Completion ring. On Rx, device will take >> a buffer form a pre-populated Fill ring, write the packet data into >> it and place the buffer into Rx ring. >> >> AF_XDP network backend takes on the communication with the host >> kernel and the network interface and forwards packets to/from the >> peer device in QEMU. >> >> Usage example: >> >> -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C >> -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1 >> >> XDP program bridges the socket with a network interface. It can be >> attached to the interface in 2 different modes: >> >> 1. skb - this mode should work for any interface and doesn't require >> driver support. With a caveat of lower performance. >> >> 2. native - this does require support from the driver and allows to >> bypass skb allocation in the kernel and potentially use >> zero-copy while getting packets in/out userspace. >> >> By default, QEMU will try to use native mode and fall back to skb. >> Mode can be forced via 'mode' option. To force 'copy' even in native >> mode, use 'force-copy=on' option. This might be useful if there is >> some issue with the driver. >> >> Option 'queues=N' allows to specify how many device queues should >> be open. Note that all the queues that are not open are still >> functional and can receive traffic, but it will not be delivered to >> QEMU. So, the number of device queues should generally match the >> QEMU configuration, unless the device is shared with something >> else and the traffic re-direction to appropriate queues is correctly >> configured on a device level (e.g. with ethtool -N). >> 'start-queue=M' option can be used to specify from which queue id >> QEMU should start configuring 'N' queues. It might also be necessary >> to use this option with certain NICs, e.g. MLX5 NICs. See the docs >> for examples. >> >> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN >> or CAP_BPF capabilities in order to load default XSK/XDP programs to >> the network interface and configure BPF maps. It is possible, however, >> to run with no capabilities. For that to work, an external process >> with enough capabilities will need to pre-load default XSK program, >> create AF_XDP sockets and pass their file descriptors to QEMU process >> on startup via 'sock-fds' option. Network backend will need to be >> configured with 'inhibit=on' to avoid loading of the program. >> QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue >> or CAP_IPC_LOCK. >> >> There are few performance challenges with the current network backends. >> >> First is that they do not support IO threads. This means that data >> path is handled by the main thread in QEMU and may slow down other >> work or may be slowed down by some other work. This also means that >> taking advantage of multi-queue is generally not possible today. >> >> Another thing is that data path is going through the device emulation >> code, which is not really optimized for performance. The fastest >> "frontend" device is virtio-net. But it's not optimized for heavy >> traffic either, because it expects such use-cases to be handled via >> some implementation of vhost (user, kernel, vdpa). In practice, we >> have virtio notifications and rcu lock/unlock on a per-packet basis >> and not very efficient accesses to the guest memory. Communication >> channels between backend and frontend devices do not allow passing >> more than one packet at a time as well. >> >> Some of these challenges can be avoided in the future by adding better >> batching into device emulation or by implementing vhost-af-xdp variant. >> >> There are also a few kernel limitations. AF_XDP sockets do not >> support any kinds of checksum or segmentation offloading. Buffers >> are limited to a page size (4K), i.e. MTU is limited. Multi-buffer >> support implementation for AF_XDP is in progress, but not ready yet. >> Also, transmission in all non-zero-copy modes is synchronous, i.e. >> done in a syscall. That doesn't allow high packet rates on virtual >> interfaces. >> >> However, keeping in mind all of these challenges, current implementation >> of the AF_XDP backend shows a decent performance while running on top >> of a physical NIC with zero-copy support. >> >> Test setup: >> >> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card. >> Network backend is configured to open the NIC directly in native mode. >> The driver supports zero-copy. NIC is configured to use 1 queue. >> >> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd >> for PPS testing. >> >> iperf3 result: >> TCP stream : 19.1 Gbps >> >> dpdk-testpmd (single queue, single CPU core, 64 B packets) results: >> Tx only : 3.4 Mpps >> Rx only : 2.0 Mpps >> L2 FWD Loopback : 1.5 Mpps >> >> In skb mode the same setup shows much lower performance, similar to >> the setup where pair of physical NICs is replaced with veth pair: >> >> iperf3 result: >> TCP stream : 9 Gbps >> >> dpdk-testpmd (single queue, single CPU core, 64 B packets) results: >> Tx only : 1.2 Mpps >> Rx only : 1.0 Mpps >> L2 FWD Loopback : 0.7 Mpps >> >> Results in skb mode or over the veth are close to results of a tap >> backend with vhost=on and disabled segmentation offloading bridged >> with a NIC. > > >> diff --git a/tests/docker/dockerfiles/debian-amd64.docker >> b/tests/docker/dockerfiles/debian-amd64.docker >> index 02262bc..811a7fe 100644 >> --- a/tests/docker/dockerfiles/debian-amd64.docker >> +++ b/tests/docker/dockerfiles/debian-amd64.docker >> @@ -98,6 +98,7 @@ RUN export DEBIAN_FRONTEND=noninteractive && \ >> libvirglrenderer-dev \ >> libvte-2.91-dev \ >> libxen-dev \ >> + libxdp-dev \ >> libzstd-dev \ >> llvm \ >> locales \ > > As the comment at the top of the file states - this is auto-generated > by lcitool and must not be hand editted like this.
Sorry, missed that part of the process initially. I see how that works now. > > Check out docs/devel/testing.rst which has guidance on the process > for adding new package deps with lcitool/libvirt-ci. Will do, thanks! > > With regards, > Daniel