On Thu, Jul 20, 2023 at 9:26 PM Ilya Maximets <i.maxim...@ovn.org> wrote: > > On 7/20/23 09:37, Jason Wang wrote: > > On Thu, Jul 6, 2023 at 4:58 AM Ilya Maximets <i.maxim...@ovn.org> wrote: > >> > >> AF_XDP is a network socket family that allows communication directly > >> with the network device driver in the kernel, bypassing most or all > >> of the kernel networking stack. In the essence, the technology is > >> pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native > >> and works with any network interfaces without driver modifications. > >> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't > >> require access to character devices or unix sockets. Only access to > >> the network interface itself is necessary. > >> > >> This patch implements a network backend that communicates with the > >> kernel by creating an AF_XDP socket. A chunk of userspace memory > >> is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, > >> Fill and Completion) are placed in that memory along with a pool of > >> memory buffers for the packet data. Data transmission is done by > >> allocating one of the buffers, copying packet data into it and > >> placing the pointer into Tx ring. After transmission, device will > >> return the buffer via Completion ring. On Rx, device will take > >> a buffer form a pre-populated Fill ring, write the packet data into > >> it and place the buffer into Rx ring. > >> > >> AF_XDP network backend takes on the communication with the host > >> kernel and the network interface and forwards packets to/from the > >> peer device in QEMU. > >> > >> Usage example: > >> > >> -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C > >> -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1 > >> > >> XDP program bridges the socket with a network interface. It can be > >> attached to the interface in 2 different modes: > >> > >> 1. skb - this mode should work for any interface and doesn't require > >> driver support. With a caveat of lower performance. > >> > >> 2. native - this does require support from the driver and allows to > >> bypass skb allocation in the kernel and potentially use > >> zero-copy while getting packets in/out userspace. > >> > >> By default, QEMU will try to use native mode and fall back to skb. > >> Mode can be forced via 'mode' option. To force 'copy' even in native > >> mode, use 'force-copy=on' option. This might be useful if there is > >> some issue with the driver. > >> > >> Option 'queues=N' allows to specify how many device queues should > >> be open. Note that all the queues that are not open are still > >> functional and can receive traffic, but it will not be delivered to > >> QEMU. So, the number of device queues should generally match the > >> QEMU configuration, unless the device is shared with something > >> else and the traffic re-direction to appropriate queues is correctly > >> configured on a device level (e.g. with ethtool -N). > >> 'start-queue=M' option can be used to specify from which queue id > >> QEMU should start configuring 'N' queues. It might also be necessary > >> to use this option with certain NICs, e.g. MLX5 NICs. See the docs > >> for examples. > >> > >> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN > >> capabilities in order to load default XSK/XDP programs to the > >> network interface and configure BPF maps. It is possible, however, > >> to run with no capabilities. For that to work, an external process > >> with admin capabilities will need to pre-load default XSK program, > >> create AF_XDP sockets and pass their file descriptors to QEMU process > >> on startup via 'sock-fds' option. Network backend will need to be > >> configured with 'inhibit=on' to avoid loading of the program. > >> QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue > >> or CAP_IPC_LOCK. > >> > >> Alternatively, the file descriptor for 'xsks_map' can be passed via > >> 'xsks-map-fd=N' option instead of passing socket file descriptors. > >> That will additionally require CAP_NET_RAW on QEMU side. This is > >> useful, because 'sock-fds' may not be available with older libxdp. > >> 'sock-fds' requires libxdp >= 1.4.0. > >> > >> There are few performance challenges with the current network backends. > >> > >> First is that they do not support IO threads. This means that data > >> path is handled by the main thread in QEMU and may slow down other > >> work or may be slowed down by some other work. This also means that > >> taking advantage of multi-queue is generally not possible today. > >> > >> Another thing is that data path is going through the device emulation > >> code, which is not really optimized for performance. The fastest > >> "frontend" device is virtio-net. But it's not optimized for heavy > >> traffic either, because it expects such use-cases to be handled via > >> some implementation of vhost (user, kernel, vdpa). In practice, we > >> have virtio notifications and rcu lock/unlock on a per-packet basis > >> and not very efficient accesses to the guest memory. Communication > >> channels between backend and frontend devices do not allow passing > >> more than one packet at a time as well. > >> > >> Some of these challenges can be avoided in the future by adding better > >> batching into device emulation or by implementing vhost-af-xdp variant. > >> > >> There are also a few kernel limitations. AF_XDP sockets do not > >> support any kinds of checksum or segmentation offloading. Buffers > >> are limited to a page size (4K), i.e. MTU is limited. Multi-buffer > >> support implementation for AF_XDP is in progress, but not ready yet. > >> Also, transmission in all non-zero-copy modes is synchronous, i.e. > >> done in a syscall. That doesn't allow high packet rates on virtual > >> interfaces. > >> > >> However, keeping in mind all of these challenges, current implementation > >> of the AF_XDP backend shows a decent performance while running on top > >> of a physical NIC with zero-copy support. > >> > >> Test setup: > >> > >> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card. > >> Network backend is configured to open the NIC directly in native mode. > >> The driver supports zero-copy. NIC is configured to use 1 queue. > >> > >> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd > >> for PPS testing. > >> > >> iperf3 result: > >> TCP stream : 19.1 Gbps > >> > >> dpdk-testpmd (single queue, single CPU core, 64 B packets) results: > >> Tx only : 3.4 Mpps > >> Rx only : 2.0 Mpps > >> L2 FWD Loopback : 1.5 Mpps > >> > >> In skb mode the same setup shows much lower performance, similar to > >> the setup where pair of physical NICs is replaced with veth pair: > >> > >> iperf3 result: > >> TCP stream : 9 Gbps > >> > >> dpdk-testpmd (single queue, single CPU core, 64 B packets) results: > >> Tx only : 1.2 Mpps > >> Rx only : 1.0 Mpps > >> L2 FWD Loopback : 0.7 Mpps > >> > >> Results in skb mode or over the veth are close to results of a tap > >> backend with vhost=on and disabled segmentation offloading bridged > >> with a NIC. > >> > >> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org> > > > > Looks good overall, see few comments inline. > > Thanks for review! > > > > >> --- > >> > >> Version 2: > >> > >> - Added support for running with no capabilities by passing > >> pre-created AF_XDP socket file descriptors via 'sock-fds' option. > >> Conditionally complied because requires unreleased libxdp 1.4.0. > >> The last restriction is having 32 MB of RLIMIT_MEMLOCK per queue. > >> > >> - Refined and extended documentation. > >> > >> > >> MAINTAINERS | 4 + > >> hmp-commands.hx | 2 +- > >> meson.build | 19 + > >> meson_options.txt | 2 + > >> net/af-xdp.c | 570 ++++++++++++++++++ > >> net/clients.h | 5 + > >> net/meson.build | 3 + > >> net/net.c | 6 + > >> qapi/net.json | 60 +- > >> qemu-options.hx | 83 ++- > >> .../ci/org.centos/stream/8/x86_64/configure | 1 + > >> scripts/meson-buildoptions.sh | 3 + > >> tests/docker/dockerfiles/debian-amd64.docker | 1 + > >> 13 files changed, 756 insertions(+), 3 deletions(-) > >> create mode 100644 net/af-xdp.c > >> > >> diff --git a/MAINTAINERS b/MAINTAINERS > >> index 7164cf55a1..80d4ba4004 100644 > >> --- a/MAINTAINERS > >> +++ b/MAINTAINERS > >> @@ -2929,6 +2929,10 @@ W: http://info.iet.unipi.it/~luigi/netmap/ > >> S: Maintained > >> F: net/netmap.c > >> > >> +AF_XDP network backend > >> +R: Ilya Maximets <i.maxim...@ovn.org> > >> +F: net/af-xdp.c > >> + > >> Host Memory Backends > >> M: David Hildenbrand <da...@redhat.com> > >> M: Igor Mammedov <imamm...@redhat.com> > >> diff --git a/hmp-commands.hx b/hmp-commands.hx > >> index 2cbd0f77a0..af9ffe4681 100644 > >> --- a/hmp-commands.hx > >> +++ b/hmp-commands.hx > >> @@ -1295,7 +1295,7 @@ ERST > >> { > >> .name = "netdev_add", > >> .args_type = "netdev:O", > >> - .params = > >> "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|vhost-user" > >> + .params = > >> "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|af-xdp|vhost-user" > >> #ifdef CONFIG_VMNET > >> "|vmnet-host|vmnet-shared|vmnet-bridged" > >> #endif > >> diff --git a/meson.build b/meson.build > >> index a9ba0bfab3..1f8772ea5d 100644 > >> --- a/meson.build > >> +++ b/meson.build > >> @@ -1891,6 +1891,18 @@ if libbpf.found() and not cc.links(''' > >> endif > >> endif > >> > >> +# libxdp > >> +libxdp = dependency('libxdp', required: get_option('af_xdp'), method: > >> 'pkg-config') > >> +if libxdp.found() and \ > >> + not (libbpf.found() and libbpf.version().version_compare('>=0.7')) > >> + libxdp = not_found > >> + if get_option('af_xdp').enabled() > >> + error('af-xdp support requires libbpf version >= 0.7') > > > > Can we simply limit this to 1.4? > > This is a check for libbpf, not libxdp. Or do you think there is no need > to check libbpf version if we request libxdp version high enough? > Users may still break the build by installing old libbpf manually even if > distributions ship more modern versions. > > Or do you mean limit the libxdp version to 1.4 in order to avoid conditional > on HAVE_XSK_UMEM__CREATE_WITH_FD ?
Yes. > The problem with that is that libxdp 1.4 is a week+ old, so not available in > any distribution, AFAIK. Not sure how big of a problem that is though. It doesn't matter as this is a brand new backend, it would simplify future maintenance if we can get rid of any HAVE_XXX macros. > > > > > > >> + else > >> + warning('af-xdp support requires libbpf version >= 0.7, disabling') > >> + endif > >> +endif > >> + > >> # libdw > >> libdw = not_found > >> if not get_option('libdw').auto() or \ > >> @@ -2112,6 +2124,12 @@ config_host_data.set('CONFIG_HEXAGON_IDEF_PARSER', > >> get_option('hexagon_idef_pars > >> config_host_data.set('CONFIG_LIBATTR', have_old_libattr) > >> config_host_data.set('CONFIG_LIBCAP_NG', libcap_ng.found()) > >> config_host_data.set('CONFIG_EBPF', libbpf.found()) > >> +config_host_data.set('CONFIG_AF_XDP', libxdp.found()) > >> +if libxdp.found() > >> + config_host_data.set('HAVE_XSK_UMEM__CREATE_WITH_FD', > >> + cc.has_function('xsk_umem__create_with_fd', > >> + dependencies: libxdp)) > >> +endif > >> config_host_data.set('CONFIG_LIBDAXCTL', libdaxctl.found()) > >> config_host_data.set('CONFIG_LIBISCSI', libiscsi.found()) > >> config_host_data.set('CONFIG_LIBNFS', libnfs.found()) > >> @@ -4285,6 +4303,7 @@ summary_info += {'PVRDMA support': have_pvrdma} > >> summary_info += {'fdt support': fdt_opt == 'disabled' ? false : > >> fdt_opt} > >> summary_info += {'libcap-ng support': libcap_ng} > >> summary_info += {'bpf support': libbpf} > >> +summary_info += {'AF_XDP support': libxdp} > >> summary_info += {'rbd support': rbd} > >> summary_info += {'smartcard support': cacard} > >> summary_info += {'U2F support': u2f} > >> diff --git a/meson_options.txt b/meson_options.txt > >> index bbb5c7e886..f4e950ce6a 100644 > >> --- a/meson_options.txt > >> +++ b/meson_options.txt > >> @@ -120,6 +120,8 @@ option('avx512bw', type: 'feature', value: 'auto', > >> option('keyring', type: 'feature', value: 'auto', > >> description: 'Linux keyring support') > >> > >> +option('af_xdp', type : 'feature', value : 'auto', > >> + description: 'AF_XDP network backend support') > >> option('attr', type : 'feature', value : 'auto', > >> description: 'attr/xattr support') > >> option('auth_pam', type : 'feature', value : 'auto', > >> diff --git a/net/af-xdp.c b/net/af-xdp.c > >> new file mode 100644 > >> index 0000000000..265ba6b12e > >> --- /dev/null > >> +++ b/net/af-xdp.c > >> @@ -0,0 +1,570 @@ > >> +/* > >> + * AF_XDP network backend. > >> + * > >> + * Copyright (c) 2023 Red Hat, Inc. > >> + * > >> + * Authors: > >> + * Ilya Maximets <i.maxim...@ovn.org> > >> + * > >> + * This work is licensed under the terms of the GNU GPL, version 2 or > >> later. > >> + * See the COPYING file in the top-level directory. > >> + */ > >> + > >> + > >> +#include "qemu/osdep.h" > >> +#include <bpf/bpf.h> > >> +#include <inttypes.h> > >> +#include <linux/if_link.h> > >> +#include <linux/if_xdp.h> > >> +#include <net/if.h> > >> +#include <xdp/xsk.h> > >> + > >> +#include "clients.h" > >> +#include "monitor/monitor.h" > >> +#include "net/net.h" > >> +#include "qapi/error.h" > >> +#include "qemu/cutils.h" > >> +#include "qemu/error-report.h" > >> +#include "qemu/iov.h" > >> +#include "qemu/main-loop.h" > >> +#include "qemu/memalign.h" > >> + > >> + > >> +typedef struct AFXDPState { > >> + NetClientState nc; > >> + > >> + struct xsk_socket *xsk; > >> + struct xsk_ring_cons rx; > >> + struct xsk_ring_prod tx; > >> + struct xsk_ring_cons cq; > >> + struct xsk_ring_prod fq; > >> + > >> + char ifname[IFNAMSIZ]; > >> + int ifindex; > >> + bool read_poll; > >> + bool write_poll; > >> + uint32_t outstanding_tx; > >> + > >> + uint64_t *pool; > >> + uint32_t n_pool; > >> + char *buffer; > >> + struct xsk_umem *umem; > >> + > >> + uint32_t n_queues; > >> + uint32_t xdp_flags; > >> + bool inhibit; > >> +} AFXDPState; > >> + > >> +#define AF_XDP_BATCH_SIZE 64 > >> + > >> +static void af_xdp_send(void *opaque); > >> +static void af_xdp_writable(void *opaque); > >> + > >> +/* Set the event-loop handlers for the af-xdp backend. */ > >> +static void af_xdp_update_fd_handler(AFXDPState *s) > >> +{ > >> + qemu_set_fd_handler(xsk_socket__fd(s->xsk), > >> + s->read_poll ? af_xdp_send : NULL, > >> + s->write_poll ? af_xdp_writable : NULL, > >> + s); > >> +} > >> + > >> +/* Update the read handler. */ > >> +static void af_xdp_read_poll(AFXDPState *s, bool enable) > >> +{ > >> + if (s->read_poll != enable) { > >> + s->read_poll = enable; > >> + af_xdp_update_fd_handler(s); > >> + } > >> +} > >> + > >> +/* Update the write handler. */ > >> +static void af_xdp_write_poll(AFXDPState *s, bool enable) > >> +{ > >> + if (s->write_poll != enable) { > >> + s->write_poll = enable; > >> + af_xdp_update_fd_handler(s); > >> + } > >> +} > >> + > >> +static void af_xdp_poll(NetClientState *nc, bool enable) > >> +{ > >> + AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc); > >> + > >> + if (s->read_poll != enable || s->write_poll != enable) { > >> + s->write_poll = enable; > >> + s->read_poll = enable; > >> + af_xdp_update_fd_handler(s); > >> + } > >> +} > >> + > >> +static void af_xdp_complete_tx(AFXDPState *s) > >> +{ > >> + uint32_t idx = 0; > >> + uint32_t done, i; > >> + uint64_t *addr; > >> + > >> + done = xsk_ring_cons__peek(&s->cq, XSK_RING_CONS__DEFAULT_NUM_DESCS, > >> &idx); > >> + > >> + for (i = 0; i < done; i++) { > >> + addr = (void *) xsk_ring_cons__comp_addr(&s->cq, idx++); > >> + s->pool[s->n_pool++] = *addr; > >> + s->outstanding_tx--; > >> + } > >> + > >> + if (done) { > >> + xsk_ring_cons__release(&s->cq, done); > >> + } > >> +} > >> + > >> +/* > >> + * The fd_write() callback, invoked if the fd is marked as writable > >> + * after a poll. > >> + */ > >> +static void af_xdp_writable(void *opaque) > >> +{ > >> + AFXDPState *s = opaque; > >> + > >> + /* Try to recover buffers that are already sent. */ > >> + af_xdp_complete_tx(s); > >> + > >> + /* > >> + * Unregister the handler, unless we still have packets to transmit > >> + * and kernel needs a wake up. > >> + */ > >> + if (!s->outstanding_tx || !xsk_ring_prod__needs_wakeup(&s->tx)) { > >> + af_xdp_write_poll(s, false); > >> + } > >> + > >> + /* Flush any buffered packets. */ > >> + qemu_flush_queued_packets(&s->nc); > >> +} > >> + > >> +static ssize_t af_xdp_receive(NetClientState *nc, > >> + const uint8_t *buf, size_t size) > >> +{ > >> + AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc); > >> + struct xdp_desc *desc; > >> + uint32_t idx; > >> + void *data; > >> + > >> + /* Try to recover buffers that are already sent. */ > >> + af_xdp_complete_tx(s); > >> + > >> + if (size > XSK_UMEM__DEFAULT_FRAME_SIZE) { > >> + /* We can't transmit packet this size... */ > >> + return size; > >> + } > >> + > >> + if (!s->n_pool || !xsk_ring_prod__reserve(&s->tx, 1, &idx)) { > >> + /* > >> + * Out of buffers or space in tx ring. Poll until we can write. > >> + * This will also kick the Tx, if it was waiting on CQ. > >> + */ > >> + af_xdp_write_poll(s, true); > >> + return 0; > >> + } > >> + > >> + desc = xsk_ring_prod__tx_desc(&s->tx, idx); > >> + desc->addr = s->pool[--s->n_pool]; > >> + desc->len = size; > >> + > >> + data = xsk_umem__get_data(s->buffer, desc->addr); > >> + memcpy(data, buf, size); > >> + > >> + xsk_ring_prod__submit(&s->tx, 1); > >> + s->outstanding_tx++; > >> + > >> + if (xsk_ring_prod__needs_wakeup(&s->tx)) { > >> + af_xdp_write_poll(s, true); > >> + } > >> + > >> + return size; > >> +} > >> + > >> +/* > >> + * Complete a previous send (backend --> guest) and enable the > >> + * fd_read callback. > >> + */ > >> +static void af_xdp_send_completed(NetClientState *nc, ssize_t len) > >> +{ > >> + AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc); > >> + > >> + af_xdp_read_poll(s, true); > >> +} > >> + > >> +static void af_xdp_fq_refill(AFXDPState *s, uint32_t n) > >> +{ > >> + uint32_t i, idx = 0; > >> + > >> + /* Leave one packet for Tx, just in case. */ > >> + if (s->n_pool < n + 1) { > >> + n = s->n_pool; > >> + } > >> + > >> + if (!n || !xsk_ring_prod__reserve(&s->fq, n, &idx)) { > >> + return; > >> + } > >> + > >> + for (i = 0; i < n; i++) { > >> + *xsk_ring_prod__fill_addr(&s->fq, idx++) = s->pool[--s->n_pool]; > >> + } > >> + xsk_ring_prod__submit(&s->fq, n); > >> + > >> + if (xsk_ring_prod__needs_wakeup(&s->fq)) { > >> + /* Receive was blocked by not having enough buffers. Wake it up. > >> */ > >> + af_xdp_read_poll(s, true); > >> + } > >> +} > >> + > >> +static void af_xdp_send(void *opaque) > >> +{ > >> + uint32_t i, n_rx, idx = 0; > >> + AFXDPState *s = opaque; > >> + > >> + n_rx = xsk_ring_cons__peek(&s->rx, AF_XDP_BATCH_SIZE, &idx); > >> + if (!n_rx) { > >> + return; > >> + } > >> + > >> + for (i = 0; i < n_rx; i++) { > >> + const struct xdp_desc *desc; > >> + struct iovec iov; > >> + > >> + desc = xsk_ring_cons__rx_desc(&s->rx, idx++); > >> + > >> + iov.iov_base = xsk_umem__get_data(s->buffer, desc->addr); > >> + iov.iov_len = desc->len; > >> + > >> + s->pool[s->n_pool++] = desc->addr; > >> + > >> + if (!qemu_sendv_packet_async(&s->nc, &iov, 1, > >> + af_xdp_send_completed)) { > >> + /* > >> + * The peer does not receive anymore. Packet is queued, stop > >> + * reading from the backend until af_xdp_send_completed(). > >> + */ > >> + af_xdp_read_poll(s, false); > >> + > >> + /* Re-peek the descriptors to not break the ring cache. */ > >> + xsk_ring_cons__cancel(&s->rx, n_rx); > >> + n_rx = xsk_ring_cons__peek(&s->rx, i + 1, &idx); > > > > The code turns out to be hard to read here. > > > > 1) This seems to undo the peek (usually peek doesn't touch the > > prod/consumer but it seems not what xsk_ring_cons__peek()) did: > > Yeah, it's unfortunate that the peek() function changes the internal > state, but that is what we have... > > > > > static inline __u32 xsk_ring_cons__peek(struct xsk_ring_cons *cons, > > __u32 nb, __u32 *idx) > > { > > __u32 entries = xsk_cons_nb_avail(cons, nb); > > > > if (entries > 0) { > > *idx = cons->cached_cons; > > cons->cached_cons += entries; > > } > > > > return entries; > > } > > > > 2) It looks to me a partial rollback is sufficient? > > > > xsk_ring_cons__cancel(n_rx - i + 1)? > > Good point. Should work. It should be n_rx - i - 1 though, if I'm not > mistaken. So: > > xsk_ring_cons__cancel(n_rx - i - 1); > n_rx = i + 1; > > I'm not sure if that is much easier to read, but that's OK. Should be > a touch faster as well. What do you think? Let's do that please. > > > > >> + g_assert(n_rx == i + 1); > >> + break; > >> + } > >> + } > >> + > >> + /* Release actually sent descriptors and try to re-fill. */ > >> + xsk_ring_cons__release(&s->rx, n_rx); > >> + af_xdp_fq_refill(s, AF_XDP_BATCH_SIZE); > >> +} > >> + > >> +/* Flush and close. */ > >> +static void af_xdp_cleanup(NetClientState *nc) > >> +{ > >> + AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc); > >> + > >> + qemu_purge_queued_packets(nc); > >> + > >> + af_xdp_poll(nc, false); > >> + > >> + xsk_socket__delete(s->xsk); > >> + s->xsk = NULL; > >> + g_free(s->pool); > >> + s->pool = NULL; > >> + xsk_umem__delete(s->umem); > >> + s->umem = NULL; > >> + qemu_vfree(s->buffer); > >> + s->buffer = NULL; > >> + > >> + /* Remove the program if it's the last open queue. */ > >> + if (!s->inhibit && nc->queue_index == s->n_queues - 1 && s->xdp_flags > >> + && bpf_xdp_detach(s->ifindex, s->xdp_flags, NULL) != 0) { > >> + fprintf(stderr, > >> + "af-xdp: unable to remove XDP program from '%s', ifindex: > >> %d\n", > >> + s->ifname, s->ifindex); > >> + } > >> +} > >> + > >> +static int af_xdp_umem_create(AFXDPState *s, int sock_fd, Error **errp) > >> +{ > >> + struct xsk_umem_config config = { > >> + .fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS, > >> + .comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS, > >> + .frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE, > >> + .frame_headroom = 0, > >> + }; > >> + uint64_t n_descs; > >> + uint64_t size; > >> + int64_t i; > >> + int ret; > >> + > >> + /* Number of descriptors if all 4 queues (rx, tx, cq, fq) are full. */ > >> + n_descs = (XSK_RING_PROD__DEFAULT_NUM_DESCS > >> + + XSK_RING_CONS__DEFAULT_NUM_DESCS) * 2; > >> + size = n_descs * XSK_UMEM__DEFAULT_FRAME_SIZE; > >> + > >> + s->buffer = qemu_memalign(qemu_real_host_page_size(), size); > >> + memset(s->buffer, 0, size); > >> + > >> + if (sock_fd < 0) { > >> + ret = xsk_umem__create(&s->umem, s->buffer, size, > >> + &s->fq, &s->cq, &config); > >> + } else { > >> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD > >> + ret = xsk_umem__create_with_fd(&s->umem, sock_fd, s->buffer, size, > >> + &s->fq, &s->cq, &config); > >> +#else > > > > So sock_fds without HAVE_XSK_UMEM__CREATE_WITH_FD won't work. We'd better > > > > 1) disable sock_fds without HAVE_XSK_UMEM__CREATE_WITH_FD > > The qapi property is conditionally defined, so users will not be able to > set sock-fds if not supported. And qemu will complain about unknown > property. That should be enough? Yes. > > > > > or > > > > 2) disable af_xdp without HAVE_XSK_UMEM__CREATE_WITH_FD > > If we require libxdp 1.4 that will be the case. > > > > >> + ret = -1; > >> + errno = EINVAL; > >> +#endif > >> + } > >> + > >> + if (ret) { > >> + qemu_vfree(s->buffer); > >> + error_setg_errno(errp, errno, > >> + "failed to create umem for %s queue_index: %d", > >> + s->ifname, s->nc.queue_index); > >> + return -1; > >> + } > >> + > >> + s->pool = g_new(uint64_t, n_descs); > >> + /* Fill the pool in the opposite order, because it's a LIFO queue. */ > >> + for (i = n_descs; i >= 0; i--) { > >> + s->pool[i] = i * XSK_UMEM__DEFAULT_FRAME_SIZE; > >> + } > >> + s->n_pool = n_descs; > >> + > >> + af_xdp_fq_refill(s, XSK_RING_PROD__DEFAULT_NUM_DESCS); > >> + > >> + return 0; > >> +} > >> + > >> +static int af_xdp_socket_create(AFXDPState *s, > >> + const NetdevAFXDPOptions *opts, > >> + int xsks_map_fd, Error **errp) > >> +{ > >> + struct xsk_socket_config cfg = { > >> + .rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS, > >> + .tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS, > >> + .libxdp_flags = 0, > >> + .bind_flags = XDP_USE_NEED_WAKEUP, > >> + .xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST, > >> + }; > >> + int queue_id, error = 0; > >> + > >> + s->inhibit = opts->has_inhibit && opts->inhibit; > >> + if (s->inhibit) { > >> + cfg.libxdp_flags |= XSK_LIBXDP_FLAGS__INHIBIT_PROG_LOAD; > >> + } > >> + > >> + if (opts->has_force_copy && opts->force_copy) { > >> + cfg.bind_flags |= XDP_COPY; > >> + } > >> + > >> + queue_id = s->nc.queue_index; > >> + if (opts->has_start_queue && opts->start_queue > 0) { > >> + queue_id += opts->start_queue; > >> + } > >> + > >> + if (opts->has_mode) { > >> + /* Specific mode requested. */ > >> + cfg.xdp_flags |= (opts->mode == AFXDP_MODE_NATIVE) > >> + ? XDP_FLAGS_DRV_MODE : XDP_FLAGS_SKB_MODE; > >> + if (xsk_socket__create(&s->xsk, s->ifname, queue_id, > >> + s->umem, &s->rx, &s->tx, &cfg)) { > >> + error = errno; > >> + } > >> + } else { > >> + /* No mode requested, try native first. */ > >> + cfg.xdp_flags |= XDP_FLAGS_DRV_MODE; > >> + > >> + if (xsk_socket__create(&s->xsk, s->ifname, queue_id, > >> + s->umem, &s->rx, &s->tx, &cfg)) { > >> + /* Can't use native mode, try skb. */ > >> + cfg.xdp_flags &= ~XDP_FLAGS_DRV_MODE; > >> + cfg.xdp_flags |= XDP_FLAGS_SKB_MODE; > >> + > >> + if (xsk_socket__create(&s->xsk, s->ifname, queue_id, > >> + s->umem, &s->rx, &s->tx, &cfg)) { > >> + error = errno; > >> + } > >> + } > >> + } > >> + > >> + if (error) { > >> + error_setg_errno(errp, error, > >> + "failed to create AF_XDP socket for %s queue_id: > >> %d", > >> + s->ifname, queue_id); > >> + return -1; > >> + } > >> + > >> + if (s->inhibit && xsks_map_fd >= 0) { > >> + int xsk_fd = xsk_socket__fd(s->xsk); > >> + > >> + /* Need to update the map manually, libxdp skipped that step. */ > >> + error = bpf_map_update_elem(xsks_map_fd, &queue_id, &xsk_fd, 0); > >> + if (error) { > >> + error_setg_errno(errp, error, > >> + "failed to update xsks map for %s queue_id: > >> %d", > >> + s->ifname, queue_id); > >> + return -1; > >> + } > >> + } > >> + > >> + s->xdp_flags = cfg.xdp_flags; > >> + > >> + return 0; > >> +} > >> + > >> +/* NetClientInfo methods. */ > >> +static NetClientInfo net_af_xdp_info = { > >> + .type = NET_CLIENT_DRIVER_AF_XDP, > >> + .size = sizeof(AFXDPState), > >> + .receive = af_xdp_receive, > >> + .poll = af_xdp_poll, > >> + .cleanup = af_xdp_cleanup, > >> +}; > >> + > >> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD > >> +static int *parse_socket_fds(const char *sock_fds_str, > >> + int64_t n_expected, Error **errp) > >> +{ > >> + gchar **substrings = g_strsplit(sock_fds_str, ":", -1); > >> + int64_t i, n_sock_fds = g_strv_length(substrings); > >> + int *sock_fds = NULL; > >> + > >> + if (n_sock_fds != n_expected) { > >> + error_setg(errp, "expected %"PRIi64" socket fds, got %"PRIi64, > >> + n_expected, n_sock_fds); > >> + goto exit; > >> + } > >> + > >> + sock_fds = g_new(int, n_sock_fds); > >> + > >> + for (i = 0; i < n_sock_fds; i++) { > >> + sock_fds[i] = monitor_fd_param(monitor_cur(), substrings[i], > >> errp); > >> + if (sock_fds[i] < 0) { > >> + g_free(sock_fds); > >> + sock_fds = NULL; > >> + goto exit; > >> + } > >> + } > >> + > >> +exit: > >> + g_strfreev(substrings); > >> + return sock_fds; > >> +} > >> +#endif > >> + > >> +/* > >> + * The exported init function. > >> + * > >> + * ... -net af-xdp,ifname="..." > > > > This is the legacy command line, let's say -netdev af-xdp,... > > Sure. > > > > >> + */ > >> +int net_init_af_xdp(const Netdev *netdev, > >> + const char *name, NetClientState *peer, Error **errp) > >> +{ > >> + const NetdevAFXDPOptions *opts = &netdev->u.af_xdp; > >> + NetClientState *nc, *nc0 = NULL; > >> + unsigned int ifindex; > >> + uint32_t prog_id = 0; > >> + int *sock_fds = NULL; > >> + int xsks_map_fd = -1; > >> + int64_t i, queues; > >> + Error *err = NULL; > >> + AFXDPState *s; > >> + > >> + ifindex = if_nametoindex(opts->ifname); > >> + if (!ifindex) { > >> + error_setg_errno(errp, errno, "failed to get ifindex for '%s'", > >> + opts->ifname); > >> + return -1; > >> + } > >> + > >> + queues = opts->has_queues ? opts->queues : 1; > >> + if (queues < 1) { > >> + error_setg(errp, "invalid number of queues (%" PRIi64 ") for > >> '%s'", > >> + queues, opts->ifname); > >> + return -1; > >> + } > >> + > >> +#ifndef HAVE_XSK_UMEM__CREATE_WITH_FD > >> + if ((opts->has_inhibit && opts->inhibit) != !!opts->xsks_map_fd) { > >> + error_setg(errp, "expected 'inhibit=on' and 'xsks-map-fd' > >> together"); > >> + return -1; > >> + } > >> +#else > >> + if ((opts->has_inhibit && opts->inhibit) > >> + != (opts->xsks_map_fd || opts->sock_fds)) { > >> + error_setg(errp, "'inhibit=on' should be used together with " > >> + "'sock-fds' or 'xsks-map-fd'"); > >> + return -1; > >> + } > >> + > >> + if (opts->xsks_map_fd && opts->sock_fds) { > >> + error_setg(errp, "'sock-fds' and 'xsks-map-fd' are mutually > >> exclusive"); > >> + return -1; > >> + } > >> + > >> + if (opts->sock_fds) { > >> + sock_fds = parse_socket_fds(opts->sock_fds, queues, errp); > >> + if (!sock_fds) { > >> + return -1; > >> + } > >> + } > >> +#endif > >> + > >> + if (opts->xsks_map_fd) { > >> + xsks_map_fd = monitor_fd_param(monitor_cur(), opts->xsks_map_fd, > >> errp); > >> + if (xsks_map_fd < 0) { > >> + goto err; > >> + } > >> + } > >> + > >> + for (i = 0; i < queues; i++) { > >> + nc = qemu_new_net_client(&net_af_xdp_info, peer, "af-xdp", name); > >> + qemu_set_info_str(nc, "af-xdp%"PRIi64" to %s", i, opts->ifname); > >> + nc->queue_index = i; > >> + > >> + if (!nc0) { > >> + nc0 = nc; > >> + } > >> + > >> + s = DO_UPCAST(AFXDPState, nc, nc); > >> + > >> + pstrcpy(s->ifname, sizeof(s->ifname), opts->ifname); > >> + s->ifindex = ifindex; > >> + s->n_queues = queues; > >> + > >> + if (af_xdp_umem_create(s, sock_fds ? sock_fds[i] : -1, errp) > >> + || af_xdp_socket_create(s, opts, xsks_map_fd, errp)) { > >> + /* Make sure the XDP program will be removed. */ > >> + s->n_queues = i; > >> + error_propagate(errp, err); > >> + goto err; > >> + } > >> + } > >> + > >> + if (nc0) { > >> + s = DO_UPCAST(AFXDPState, nc, nc0); > >> + if (bpf_xdp_query_id(s->ifindex, s->xdp_flags, &prog_id) || > >> !prog_id) { > >> + error_setg_errno(errp, errno, > >> + "no XDP program loaded on '%s', ifindex: %d", > >> + s->ifname, s->ifindex); > >> + goto err; > >> + } > >> + } > >> + > >> + af_xdp_read_poll(s, true); /* Initially only poll for reads. */ > >> + > >> + return 0; > >> + > >> +err: > >> + g_free(sock_fds); > >> + if (nc0) { > >> + qemu_del_net_client(nc0); > >> + } > >> + > >> + return -1; > >> +} > >> diff --git a/net/clients.h b/net/clients.h > >> index ed8bdfff1e..be53794582 100644 > >> --- a/net/clients.h > >> +++ b/net/clients.h > >> @@ -64,6 +64,11 @@ int net_init_netmap(const Netdev *netdev, const char > >> *name, > >> NetClientState *peer, Error **errp); > >> #endif > >> > >> +#ifdef CONFIG_AF_XDP > >> +int net_init_af_xdp(const Netdev *netdev, const char *name, > >> + NetClientState *peer, Error **errp); > >> +#endif > >> + > >> int net_init_vhost_user(const Netdev *netdev, const char *name, > >> NetClientState *peer, Error **errp); > >> > >> diff --git a/net/meson.build b/net/meson.build > >> index bdf564a57b..61628d4684 100644 > >> --- a/net/meson.build > >> +++ b/net/meson.build > >> @@ -36,6 +36,9 @@ system_ss.add(when: vde, if_true: files('vde.c')) > >> if have_netmap > >> system_ss.add(files('netmap.c')) > >> endif > >> + > >> +system_ss.add(when: libxdp, if_true: files('af-xdp.c')) > >> + > >> if have_vhost_net_user > >> system_ss.add(when: 'CONFIG_VIRTIO_NET', if_true: > >> files('vhost-user.c'), if_false: files('vhost-user-stub.c')) > >> system_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-user-stub.c')) > >> diff --git a/net/net.c b/net/net.c > >> index 6492ad530e..127f70932b 100644 > >> --- a/net/net.c > >> +++ b/net/net.c > >> @@ -1082,6 +1082,9 @@ static int (* const > >> net_client_init_fun[NET_CLIENT_DRIVER__MAX])( > >> #ifdef CONFIG_NETMAP > >> [NET_CLIENT_DRIVER_NETMAP] = net_init_netmap, > >> #endif > >> +#ifdef CONFIG_AF_XDP > >> + [NET_CLIENT_DRIVER_AF_XDP] = net_init_af_xdp, > >> +#endif > >> #ifdef CONFIG_NET_BRIDGE > >> [NET_CLIENT_DRIVER_BRIDGE] = net_init_bridge, > >> #endif > >> @@ -1186,6 +1189,9 @@ void show_netdevs(void) > >> #ifdef CONFIG_NETMAP > >> "netmap", > >> #endif > >> +#ifdef CONFIG_AF_XDP > >> + "af-xdp", > >> +#endif > >> #ifdef CONFIG_POSIX > >> "vhost-user", > >> #endif > >> diff --git a/qapi/net.json b/qapi/net.json > >> index db67501308..88f2c982c2 100644 > >> --- a/qapi/net.json > >> +++ b/qapi/net.json > >> @@ -408,6 +408,62 @@ > >> 'ifname': 'str', > >> '*devname': 'str' } } > >> > >> +## > >> +# @AFXDPMode: > >> +# > >> +# Attach mode for a default XDP program > >> +# > >> +# @skb: generic mode, no driver support necessary > >> +# > >> +# @native: DRV mode, program is attached to a driver, packets are passed > >> to > >> +# the socket without allocation of skb. > >> +# > >> +# Since: 8.1 > > > > I'd make it for 8.2. > > OK. > > > > >> +## > >> +{ 'enum': 'AFXDPMode', > >> + 'data': [ 'native', 'skb' ] } > >> + > >> +## > >> +# @NetdevAFXDPOptions: > >> +# > >> +# AF_XDP network backend > >> +# > >> +# @ifname: The name of an existing network interface. > >> +# > >> +# @mode: Attach mode for a default XDP program. If not specified, then > >> +# 'native' will be tried first, then 'skb'. > >> +# > >> +# @force-copy: Force XDP copy mode even if device supports zero-copy. > >> +# (default: false) > >> +# > >> +# @queues: number of queues to be used for multiqueue interfaces > >> (default: 1). > >> +# > >> +# @start-queue: Use @queues starting from this queue number (default: 0). > >> +# > >> +# @inhibit: Don't load a default XDP program, use one already loaded to > >> +# the interface (default: false). Requires @sock-fds or @xsks-map-fd. > >> +# > >> +# @sock-fds: A colon (:) separated list of file descriptors for already > >> open > >> +# but not bound AF_XDP sockets in the queue order. One fd per queue. > >> +# These descriptors should already be added into XDP socket map for > >> +# corresponding queues. Requires @inhibit. > >> +# > >> +# @xsks-map-fd: A file descriptor for an already open XDP socket map in > >> +# the already loaded XDP program. Requires @inhibit. > >> +# > >> +# Since: 8.1 > >> +## > >> +{ 'struct': 'NetdevAFXDPOptions', > >> + 'data': { > >> + 'ifname': 'str', > >> + '*mode': 'AFXDPMode', > >> + '*force-copy': 'bool', > >> + '*queues': 'int', > >> + '*start-queue': 'int', > >> + '*inhibit': 'bool', > >> + '*sock-fds': { 'type': 'str', 'if': > >> 'HAVE_XSK_UMEM__CREATE_WITH_FD' }, > > The paramater is defined conditionally hare and it will not be > compiled in, if HAVE_XSK_UMEM__CREATE_WITH_FD is not defined. Right, I missed that. > > >> + '*xsks-map-fd': 'str' } } > >> + > >> ## > >> # @NetdevVhostUserOptions: > >> # > >> @@ -642,13 +698,14 @@ > >> # @vmnet-bridged: since 7.1 > >> # @stream: since 7.2 > >> # @dgram: since 7.2 > >> +# @af-xdp: since 8.1 > >> # > >> # Since: 2.7 > >> ## > >> { 'enum': 'NetClientDriver', > >> 'data': [ 'none', 'nic', 'user', 'tap', 'l2tpv3', 'socket', 'stream', > >> 'dgram', 'vde', 'bridge', 'hubport', 'netmap', 'vhost-user', > >> - 'vhost-vdpa', > >> + 'vhost-vdpa', 'af-xdp', > >> { 'name': 'vmnet-host', 'if': 'CONFIG_VMNET' }, > >> { 'name': 'vmnet-shared', 'if': 'CONFIG_VMNET' }, > >> { 'name': 'vmnet-bridged', 'if': 'CONFIG_VMNET' }] } > >> @@ -680,6 +737,7 @@ > >> 'bridge': 'NetdevBridgeOptions', > >> 'hubport': 'NetdevHubPortOptions', > >> 'netmap': 'NetdevNetmapOptions', > >> + 'af-xdp': 'NetdevAFXDPOptions', > >> 'vhost-user': 'NetdevVhostUserOptions', > >> 'vhost-vdpa': 'NetdevVhostVDPAOptions', > >> 'vmnet-host': { 'type': 'NetdevVmnetHostOptions', > >> diff --git a/qemu-options.hx b/qemu-options.hx > >> index b57489d7ca..d91610701c 100644 > >> --- a/qemu-options.hx > >> +++ b/qemu-options.hx > >> @@ -2856,6 +2856,25 @@ DEF("netdev", HAS_ARG, QEMU_OPTION_netdev, > >> " VALE port (created on the fly) called 'name' > >> ('nmname' is name of the \n" > >> " netmap device, defaults to '/dev/netmap')\n" > >> #endif > >> +#ifdef CONFIG_AF_XDP > >> + "-netdev > >> af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off]\n" > >> + " > >> [,queues=n][,start-queue=m][,inhibit=on|off][,xsks-map-fd=k]\n" > >> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD > >> + " [,sock-fds=x:y:...:z]\n" > >> +#endif > >> + " attach to the existing network interface 'name' with > >> AF_XDP socket\n" > >> + " use 'mode=MODE' to specify an XDP program attach > >> mode\n" > >> + " use 'force-copy=on|off' to force XDP copy mode even > >> if device supports zero-copy (default: off)\n" > >> + " use 'inhibit=on|off' to inhibit loading of a default > >> XDP program (default: off)\n" > >> + " with inhibit=on,\n" > >> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD > >> + " use 'sock-fds' to provide file descriptors for > >> already open XDP sockets\n" > >> + " added to a socket map in XDP program. One socket > >> per queue. Or\n" > >> +#endif > >> + " use 'xsks-map-fd=k' to provide a file descriptor > >> for xsks map\n" > >> + " use 'queues=n' to specify how many queues of a > >> multiqueue interface should be used\n" > >> + " use 'start-queue=m' to specify the first queue that > >> should be used\n" > >> +#endif > >> #ifdef CONFIG_POSIX > >> "-netdev vhost-user,id=str,chardev=dev[,vhostforce=on|off]\n" > >> " configure a vhost-user network, backed by a chardev > >> 'dev'\n" > >> @@ -2901,6 +2920,9 @@ DEF("nic", HAS_ARG, QEMU_OPTION_nic, > >> #ifdef CONFIG_NETMAP > >> "netmap|" > >> #endif > >> +#ifdef CONFIG_AF_XDP > >> + "af-xdp|" > >> +#endif > >> #ifdef CONFIG_POSIX > >> "vhost-user|" > >> #endif > >> @@ -2929,6 +2951,9 @@ DEF("net", HAS_ARG, QEMU_OPTION_net, > >> #ifdef CONFIG_NETMAP > >> "netmap|" > >> #endif > >> +#ifdef CONFIG_AF_XDP > >> + "af-xdp|" > >> +#endif > >> #ifdef CONFIG_VMNET > >> "vmnet-host|vmnet-shared|vmnet-bridged|" > >> #endif > >> @@ -2936,7 +2961,7 @@ DEF("net", HAS_ARG, QEMU_OPTION_net, > >> " old way to initialize a host network interface\n" > >> " (use the -netdev option if possible instead)\n", > >> QEMU_ARCH_ALL) > >> SRST > >> -``-nic > >> [tap|bridge|user|l2tpv3|vde|netmap|vhost-user|socket][,...][,mac=macaddr][,model=mn]`` > >> +``-nic > >> [tap|bridge|user|l2tpv3|vde|netmap|af-xdp|vhost-user|socket][,...][,mac=macaddr][,model=mn]`` > >> This option is a shortcut for configuring both the on-board > >> (default) guest NIC hardware and the host network backend in one go. > >> The host backend options are the same as with the corresponding > >> @@ -3350,6 +3375,62 @@ SRST > >> # launch QEMU instance > >> |qemu_system| linux.img -nic vde,sock=/tmp/myswitch > >> > >> +``-netdev > >> af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off][,queues=n][,start-queue=m][,inhibit=on|off][,xsks-map-fd=k][,sock-fds=x:y:...:z]`` > >> + Configure AF_XDP backend to connect to a network interface 'name' > >> + using AF_XDP socket. A specific program attach mode for a default > >> + XDP program can be forced with 'mode', defaults to best-effort, > >> + where the likely most performant mode will be in use. Number of > >> queues > >> + 'n' should generally match the number or queues in the interface, > >> + defaults to 1. Traffic arriving on non-configured device queues will > >> + not be delivered to the network backend. > >> + > >> + .. parsed-literal:: > >> + > >> + # set number of queues to 1 > >> + ethtool -L eth0 combined 4 > >> + # launch QEMU instance > >> + |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\ > >> + -netdev af-xdp,id=n1,ifname=eth0,queues=4 > >> + > >> + 'start-queue' option can be specified if a particular range of queues > >> + [m, m + n] should be in use. For example, this is necessary in order > >> + to use MLX NICs in native mode. The driver will create a separate set > >> + of queues on top of regular ones, and only these queues can be used > >> + for AF_XDP sockets. MLX NICs will also require an additional traffic > >> + redirection with ethtool to these queues. > > > > Let's avoid mentioning any vendor name unless the emulation is done > > for a specific one. > > AFAIK, MLX/NVIDIA is the only vendor that implements XDP queues in this > strange fashion, so these are the only NICs that will not work without > start-queue configuration and the extra traffic re-direction. Unfortunately, > this is also not documented anywhere, so you just have to know that it works > this way... > > So, one one hand, I understand that it's not the place for vendor-specific > docs. On the other, it gives qemu users a chance to not waste a lot of time > trying to figure out why everything is configured correctly, but the traffic > doesn't flow. > > Maybe something like this: > > 'start-queue' option can be specified if a particular range of queues > [m, m + n] should be in use. For example, this is may be necessary in > order to use certain NICs in native mode. Kernel allows the driver to > create a separate set of XDP queues on top of regular ones, and only > these queues can be used for AF_XDP sockets. NICs that work this way > may also require an additional traffic redirection with ethtool to these > special queues. > > .. parsed-literal:: > > # set number of queues to 1 > ethtool -L eth0 combined 1 > # redirect all the traffic to the second queue (id: 1) > # note: drivers may require non-empty key/mask pair. > ethtool -N eth0 flow-type ether \\ > dst 00:00:00:00:00:00 m FF:FF:FF:FF:FF:FE action 1 > ethtool -N eth0 flow-type ether \\ > dst 00:00:00:00:00:01 m FF:FF:FF:FF:FF:FE action 1 > # launch QEMU instance > |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\ > -netdev af-xdp,id=n1,ifname=eth0,queues=1,start-queue=1 > > What do you think? This is great. > > > > >> + E.g.: > >> + > >> + .. parsed-literal:: > >> + > >> + # set number of queues to 1 > >> + ethtool -L eth0 combined 1 > >> + # redirect all the traffic to the second queue (id: 1) > >> + # note: mlx5 driver requires non-empty key/mask pair. > >> + ethtool -N eth0 flow-type ether \\ > >> + dst 00:00:00:00:00:00 m FF:FF:FF:FF:FF:FE action 1 > >> + ethtool -N eth0 flow-type ether \\ > >> + dst 00:00:00:00:00:01 m FF:FF:FF:FF:FF:FE action 1 > >> + # launch QEMU instance > >> + |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\ > >> + -netdev af-xdp,id=n1,ifname=eth0,queues=1,start-queue=1 > >> + > >> + XDP program can also be loaded externally. In this case 'inhibit' > >> option > >> + should be set to 'on' and 'xsks-map-fd' provided with a file > >> descriptor > >> + for an open XDP socket map of that program, or 'sock-fds' with file > >> + descriptors for already open but not bound XDP sockets already added > >> to a > >> + socket map for corresponding queues. One socket per queue. > >> + > >> + .. parsed-literal:: > >> + > >> + |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\ > >> + -netdev > >> af-xdp,id=n1,ifname=eth0,queues=3,inhibit=on,sock-fds=15:16:17 > >> + > >> + With 'inhibit=on' and 'sock-fds', QEMU process will only require 32 MB > >> + of locked memory (RLIMIT_MEMLOCK) per queue or CAP_IPC_LOCK > >> capability. > >> + With 'inhibit=on' and 'xsks-map-fd' it will additionally require > >> + CAP_NET_RAW capability. With 'inhibit=off', CAP_SYS/NET_ADMIN should > >> be > >> + added as well. > > > > This requires the synchronization with the code changes, I suggest not > > to describe: > > > > 1) actual numbers of memory requirement > > 2) capabilites required for each mode, we don't do that for other > > netdev like tap. > > Makes sense. So, should we just strike the paragraph above entirely? I think so. Thanks > > > > > Thanks > > > > > > > > > >> + > >> + > >> ``-netdev vhost-user,chardev=id[,vhostforce=on|off][,queues=n]`` > >> Establish a vhost-user netdev, backed by a chardev id. The chardev > >> should be a unix domain socket backed one. The vhost-user uses a > >> diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure > >> b/scripts/ci/org.centos/stream/8/x86_64/configure > >> index d02b09a4b9..7585c4c4ed 100755 > >> --- a/scripts/ci/org.centos/stream/8/x86_64/configure > >> +++ b/scripts/ci/org.centos/stream/8/x86_64/configure > >> @@ -35,6 +35,7 @@ > >> --block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \ > >> --with-coroutine=ucontext \ > >> --tls-priority=@QEMU,SYSTEM \ > >> +--disable-af-xdp \ > >> --disable-attr \ > >> --disable-auth-pam \ > >> --disable-avx2 \ > >> diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh > >> index 7dd5709ef4..162e455309 100644 > >> --- a/scripts/meson-buildoptions.sh > >> +++ b/scripts/meson-buildoptions.sh > >> @@ -74,6 +74,7 @@ meson_options_help() { > >> printf "%s\n" 'disabled with --disable-FEATURE, default is enabled if > >> available' > >> printf "%s\n" '(unless built with --without-default-features):' > >> printf "%s\n" '' > >> + printf "%s\n" ' af-xdp AF_XDP network backend support' > >> printf "%s\n" ' alsa ALSA sound support' > >> printf "%s\n" ' attr attr/xattr support' > >> printf "%s\n" ' auth-pam PAM access control' > >> @@ -207,6 +208,8 @@ meson_options_help() { > >> } > >> _meson_option_parse() { > >> case $1 in > >> + --enable-af-xdp) printf "%s" -Daf_xdp=enabled ;; > >> + --disable-af-xdp) printf "%s" -Daf_xdp=disabled ;; > >> --enable-alsa) printf "%s" -Dalsa=enabled ;; > >> --disable-alsa) printf "%s" -Dalsa=disabled ;; > >> --enable-attr) printf "%s" -Dattr=enabled ;; > >> diff --git a/tests/docker/dockerfiles/debian-amd64.docker > >> b/tests/docker/dockerfiles/debian-amd64.docker > >> index e39871c7bb..207f7adfb9 100644 > >> --- a/tests/docker/dockerfiles/debian-amd64.docker > >> +++ b/tests/docker/dockerfiles/debian-amd64.docker > >> @@ -97,6 +97,7 @@ RUN export DEBIAN_FRONTEND=noninteractive && \ > >> libvirglrenderer-dev \ > >> libvte-2.91-dev \ > >> libxen-dev \ > >> + libxdp-dev \ > >> libzstd-dev \ > >> llvm \ > >> locales \ > >> -- > >> 2.40.1 > >> > > >