On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasow...@redhat.com> wrote: > > On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maxim...@ovn.org> wrote: > > > > AF_XDP is a network socket family that allows communication directly > > with the network device driver in the kernel, bypassing most or all > > of the kernel networking stack. In the essence, the technology is > > pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native > > and works with any network interfaces without driver modifications. > > Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't > > require access to character devices or unix sockets. Only access to > > the network interface itself is necessary. > > > > This patch implements a network backend that communicates with the > > kernel by creating an AF_XDP socket. A chunk of userspace memory > > is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, > > Fill and Completion) are placed in that memory along with a pool of > > memory buffers for the packet data. Data transmission is done by > > allocating one of the buffers, copying packet data into it and > > placing the pointer into Tx ring. After transmission, device will > > return the buffer via Completion ring. On Rx, device will take > > a buffer form a pre-populated Fill ring, write the packet data into > > it and place the buffer into Rx ring. > > > > AF_XDP network backend takes on the communication with the host > > kernel and the network interface and forwards packets to/from the > > peer device in QEMU. > > > > Usage example: > > > > -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C > > -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1 > > > > XDP program bridges the socket with a network interface. It can be > > attached to the interface in 2 different modes: > > > > 1. skb - this mode should work for any interface and doesn't require > > driver support. With a caveat of lower performance. > > > > 2. native - this does require support from the driver and allows to > > bypass skb allocation in the kernel and potentially use > > zero-copy while getting packets in/out userspace. > > > > By default, QEMU will try to use native mode and fall back to skb. > > Mode can be forced via 'mode' option. To force 'copy' even in native > > mode, use 'force-copy=on' option. This might be useful if there is > > some issue with the driver. > > > > Option 'queues=N' allows to specify how many device queues should > > be open. Note that all the queues that are not open are still > > functional and can receive traffic, but it will not be delivered to > > QEMU. So, the number of device queues should generally match the > > QEMU configuration, unless the device is shared with something > > else and the traffic re-direction to appropriate queues is correctly > > configured on a device level (e.g. with ethtool -N). > > 'start-queue=M' option can be used to specify from which queue id > > QEMU should start configuring 'N' queues. It might also be necessary > > to use this option with certain NICs, e.g. MLX5 NICs. See the docs > > for examples. > > > > In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN > > capabilities in order to load default XSK/XDP programs to the > > network interface and configure BTF maps. > > I think you mean "BPF" actually? > > > It is possible, however, > > to run only with CAP_NET_RAW. > > Qemu often runs without any privileges, so we need to fix it. > > I think adding support for SCM_RIGHTS via monitor would be a way to go. > > > > For that to work, an external process > > with admin capabilities will need to pre-load default XSK program > > and pass an open file descriptor for this program's 'xsks_map' to > > QEMU process on startup. Network backend will need to be configured > > with 'inhibit=on' to avoid loading of the programs. The file > > descriptor for 'xsks_map' can be passed via 'xsks-map-fd=N' option. > > > > There are few performance challenges with the current network backends. > > > > First is that they do not support IO threads. > > The current networking codes needs some major recatoring to support IO > threads which I'm not sure is worthwhile. > > > This means that data > > path is handled by the main thread in QEMU and may slow down other > > work or may be slowed down by some other work. This also means that > > taking advantage of multi-queue is generally not possible today. > > > > Another thing is that data path is going through the device emulation > > code, which is not really optimized for performance. The fastest > > "frontend" device is virtio-net. But it's not optimized for heavy > > traffic either, because it expects such use-cases to be handled via > > some implementation of vhost (user, kernel, vdpa). In practice, we > > have virtio notifications and rcu lock/unlock on a per-packet basis > > and not very efficient accesses to the guest memory. Communication > > channels between backend and frontend devices do not allow passing > > more than one packet at a time as well. > > > > Some of these challenges can be avoided in the future by adding better > > batching into device emulation or by implementing vhost-af-xdp variant. > > It might require you to register(pin) the whole guest memory to XSK or > there could be a copy. Both of them are sub-optimal. > > A really interesting project is to do AF_XDP passthrough, then we > don't need to care about pin and copy and we will get ultra speed in > the guest. (But again, it might needs BPF support in virtio-net). > > > > > There are also a few kernel limitations. AF_XDP sockets do not > > support any kinds of checksum or segmentation offloading. Buffers > > are limited to a page size (4K), i.e. MTU is limited. Multi-buffer > > support is not implemented for AF_XDP today. Also, transmission in > > all non-zero-copy modes is synchronous, i.e. done in a syscall. > > That doesn't allow high packet rates on virtual interfaces. > > > > However, keeping in mind all of these challenges, current implementation > > of the AF_XDP backend shows a decent performance while running on top > > of a physical NIC with zero-copy support. > > > > Test setup: > > > > 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card. > > Network backend is configured to open the NIC directly in native mode. > > The driver supports zero-copy. NIC is configured to use 1 queue. > > > > Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd > > for PPS testing. > > > > iperf3 result: > > TCP stream : 19.1 Gbps > > > > dpdk-testpmd (single queue, single CPU core, 64 B packets) results: > > Tx only : 3.4 Mpps > > Rx only : 2.0 Mpps > > L2 FWD Loopback : 1.5 Mpps > > I don't object to merging this backend (considering we've already > merged netmap) once the code is fine, but the number is not amazing so > I wonder what is the use case for this backend?
A more ambitious method is to reuse DPDK via dedicated threads, then we can reuse any of its PMD like AF_XDP. Thanks > > Thanks