On Thu, Jul 20, 2023 at 9:26 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
>
> On 7/20/23 09:37, Jason Wang wrote:
> > On Thu, Jul 6, 2023 at 4:58 AM Ilya Maximets <i.maxim...@ovn.org> wrote:
> >>
> >> AF_XDP is a network socket family that allows communication directly
> >> with the network device driver in the kernel, bypassing most or all
> >> of the kernel networking stack.  In the essence, the technology is
> >> pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
> >> and works with any network interfaces without driver modifications.
> >> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
> >> require access to character devices or unix sockets.  Only access to
> >> the network interface itself is necessary.
> >>
> >> This patch implements a network backend that communicates with the
> >> kernel by creating an AF_XDP socket.  A chunk of userspace memory
> >> is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
> >> Fill and Completion) are placed in that memory along with a pool of
> >> memory buffers for the packet data.  Data transmission is done by
> >> allocating one of the buffers, copying packet data into it and
> >> placing the pointer into Tx ring.  After transmission, device will
> >> return the buffer via Completion ring.  On Rx, device will take
> >> a buffer form a pre-populated Fill ring, write the packet data into
> >> it and place the buffer into Rx ring.
> >>
> >> AF_XDP network backend takes on the communication with the host
> >> kernel and the network interface and forwards packets to/from the
> >> peer device in QEMU.
> >>
> >> Usage example:
> >>
> >>   -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
> >>   -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
> >>
> >> XDP program bridges the socket with a network interface.  It can be
> >> attached to the interface in 2 different modes:
> >>
> >> 1. skb - this mode should work for any interface and doesn't require
> >>          driver support.  With a caveat of lower performance.
> >>
> >> 2. native - this does require support from the driver and allows to
> >>             bypass skb allocation in the kernel and potentially use
> >>             zero-copy while getting packets in/out userspace.
> >>
> >> By default, QEMU will try to use native mode and fall back to skb.
> >> Mode can be forced via 'mode' option.  To force 'copy' even in native
> >> mode, use 'force-copy=on' option.  This might be useful if there is
> >> some issue with the driver.
> >>
> >> Option 'queues=N' allows to specify how many device queues should
> >> be open.  Note that all the queues that are not open are still
> >> functional and can receive traffic, but it will not be delivered to
> >> QEMU.  So, the number of device queues should generally match the
> >> QEMU configuration, unless the device is shared with something
> >> else and the traffic re-direction to appropriate queues is correctly
> >> configured on a device level (e.g. with ethtool -N).
> >> 'start-queue=M' option can be used to specify from which queue id
> >> QEMU should start configuring 'N' queues.  It might also be necessary
> >> to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
> >> for examples.
> >>
> >> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
> >> capabilities in order to load default XSK/XDP programs to the
> >> network interface and configure BPF maps.  It is possible, however,
> >> to run with no capabilities.  For that to work, an external process
> >> with admin capabilities will need to pre-load default XSK program,
> >> create AF_XDP sockets and pass their file descriptors to QEMU process
> >> on startup via 'sock-fds' option.  Network backend will need to be
> >> configured with 'inhibit=on' to avoid loading of the program.
> >> QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue
> >> or CAP_IPC_LOCK.
> >>
> >> Alternatively, the file descriptor for 'xsks_map' can be passed via
> >> 'xsks-map-fd=N' option instead of passing socket file descriptors.
> >> That will additionally require CAP_NET_RAW on QEMU side.  This is
> >> useful, because 'sock-fds' may not be available with older libxdp.
> >> 'sock-fds' requires libxdp >= 1.4.0.
> >>
> >> There are few performance challenges with the current network backends.
> >>
> >> First is that they do not support IO threads.  This means that data
> >> path is handled by the main thread in QEMU and may slow down other
> >> work or may be slowed down by some other work.  This also means that
> >> taking advantage of multi-queue is generally not possible today.
> >>
> >> Another thing is that data path is going through the device emulation
> >> code, which is not really optimized for performance.  The fastest
> >> "frontend" device is virtio-net.  But it's not optimized for heavy
> >> traffic either, because it expects such use-cases to be handled via
> >> some implementation of vhost (user, kernel, vdpa).  In practice, we
> >> have virtio notifications and rcu lock/unlock on a per-packet basis
> >> and not very efficient accesses to the guest memory.  Communication
> >> channels between backend and frontend devices do not allow passing
> >> more than one packet at a time as well.
> >>
> >> Some of these challenges can be avoided in the future by adding better
> >> batching into device emulation or by implementing vhost-af-xdp variant.
> >>
> >> There are also a few kernel limitations.  AF_XDP sockets do not
> >> support any kinds of checksum or segmentation offloading.  Buffers
> >> are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
> >> support implementation for AF_XDP is in progress, but not ready yet.
> >> Also, transmission in all non-zero-copy modes is synchronous, i.e.
> >> done in a syscall.  That doesn't allow high packet rates on virtual
> >> interfaces.
> >>
> >> However, keeping in mind all of these challenges, current implementation
> >> of the AF_XDP backend shows a decent performance while running on top
> >> of a physical NIC with zero-copy support.
> >>
> >> Test setup:
> >>
> >> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
> >> Network backend is configured to open the NIC directly in native mode.
> >> The driver supports zero-copy.  NIC is configured to use 1 queue.
> >>
> >> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
> >> for PPS testing.
> >>
> >> iperf3 result:
> >>  TCP stream      : 19.1 Gbps
> >>
> >> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
> >>  Tx only         : 3.4 Mpps
> >>  Rx only         : 2.0 Mpps
> >>  L2 FWD Loopback : 1.5 Mpps
> >>
> >> In skb mode the same setup shows much lower performance, similar to
> >> the setup where pair of physical NICs is replaced with veth pair:
> >>
> >> iperf3 result:
> >>   TCP stream      : 9 Gbps
> >>
> >> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
> >>   Tx only         : 1.2 Mpps
> >>   Rx only         : 1.0 Mpps
> >>   L2 FWD Loopback : 0.7 Mpps
> >>
> >> Results in skb mode or over the veth are close to results of a tap
> >> backend with vhost=on and disabled segmentation offloading bridged
> >> with a NIC.
> >>
> >> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
> >
> > Looks good overall, see few comments inline.
>
> Thanks for review!
>
> >
> >> ---
> >>
> >> Version 2:
> >>
> >>   - Added support for running with no capabilities by passing
> >>     pre-created AF_XDP socket file descriptors via 'sock-fds' option.
> >>     Conditionally complied because requires unreleased libxdp 1.4.0.
> >>     The last restriction is having 32 MB of RLIMIT_MEMLOCK per queue.
> >>
> >>   - Refined and extended documentation.
> >>
> >>
> >>  MAINTAINERS                                   |   4 +
> >>  hmp-commands.hx                               |   2 +-
> >>  meson.build                                   |  19 +
> >>  meson_options.txt                             |   2 +
> >>  net/af-xdp.c                                  | 570 ++++++++++++++++++
> >>  net/clients.h                                 |   5 +
> >>  net/meson.build                               |   3 +
> >>  net/net.c                                     |   6 +
> >>  qapi/net.json                                 |  60 +-
> >>  qemu-options.hx                               |  83 ++-
> >>  .../ci/org.centos/stream/8/x86_64/configure   |   1 +
> >>  scripts/meson-buildoptions.sh                 |   3 +
> >>  tests/docker/dockerfiles/debian-amd64.docker  |   1 +
> >>  13 files changed, 756 insertions(+), 3 deletions(-)
> >>  create mode 100644 net/af-xdp.c
> >>
> >> diff --git a/MAINTAINERS b/MAINTAINERS
> >> index 7164cf55a1..80d4ba4004 100644
> >> --- a/MAINTAINERS
> >> +++ b/MAINTAINERS
> >> @@ -2929,6 +2929,10 @@ W: http://info.iet.unipi.it/~luigi/netmap/
> >>  S: Maintained
> >>  F: net/netmap.c
> >>
> >> +AF_XDP network backend
> >> +R: Ilya Maximets <i.maxim...@ovn.org>
> >> +F: net/af-xdp.c
> >> +
> >>  Host Memory Backends
> >>  M: David Hildenbrand <da...@redhat.com>
> >>  M: Igor Mammedov <imamm...@redhat.com>
> >> diff --git a/hmp-commands.hx b/hmp-commands.hx
> >> index 2cbd0f77a0..af9ffe4681 100644
> >> --- a/hmp-commands.hx
> >> +++ b/hmp-commands.hx
> >> @@ -1295,7 +1295,7 @@ ERST
> >>      {
> >>          .name       = "netdev_add",
> >>          .args_type  = "netdev:O",
> >> -        .params     = 
> >> "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|vhost-user"
> >> +        .params     = 
> >> "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|af-xdp|vhost-user"
> >>  #ifdef CONFIG_VMNET
> >>                        "|vmnet-host|vmnet-shared|vmnet-bridged"
> >>  #endif
> >> diff --git a/meson.build b/meson.build
> >> index a9ba0bfab3..1f8772ea5d 100644
> >> --- a/meson.build
> >> +++ b/meson.build
> >> @@ -1891,6 +1891,18 @@ if libbpf.found() and not cc.links('''
> >>    endif
> >>  endif
> >>
> >> +# libxdp
> >> +libxdp = dependency('libxdp', required: get_option('af_xdp'), method: 
> >> 'pkg-config')
> >> +if libxdp.found() and \
> >> +      not (libbpf.found() and libbpf.version().version_compare('>=0.7'))
> >> +  libxdp = not_found
> >> +  if get_option('af_xdp').enabled()
> >> +    error('af-xdp support requires libbpf version >= 0.7')
> >
> > Can we simply limit this to 1.4?
>
> This is a check for libbpf, not libxdp.  Or do you think there is no need
> to check libbpf version if we request libxdp version high enough?
> Users may still break the build by installing old libbpf manually even if
> distributions ship more modern versions.
>
> Or do you mean limit the libxdp version to 1.4 in order to avoid conditional
> on HAVE_XSK_UMEM__CREATE_WITH_FD ?

Yes.

> The problem with that is that libxdp 1.4 is a week+ old, so not available in
> any distribution, AFAIK.  Not sure how big of a problem that is though.

It doesn't matter as this is a brand new backend, it would simplify
future maintenance if we can get rid of any HAVE_XXX macros.

>
> >
> >
> >> +  else
> >> +    warning('af-xdp support requires libbpf version >= 0.7, disabling')
> >> +  endif
> >> +endif
> >> +
> >>  # libdw
> >>  libdw = not_found
> >>  if not get_option('libdw').auto() or \
> >> @@ -2112,6 +2124,12 @@ config_host_data.set('CONFIG_HEXAGON_IDEF_PARSER', 
> >> get_option('hexagon_idef_pars
> >>  config_host_data.set('CONFIG_LIBATTR', have_old_libattr)
> >>  config_host_data.set('CONFIG_LIBCAP_NG', libcap_ng.found())
> >>  config_host_data.set('CONFIG_EBPF', libbpf.found())
> >> +config_host_data.set('CONFIG_AF_XDP', libxdp.found())
> >> +if libxdp.found()
> >> +  config_host_data.set('HAVE_XSK_UMEM__CREATE_WITH_FD',
> >> +                       cc.has_function('xsk_umem__create_with_fd',
> >> +                                       dependencies: libxdp))
> >> +endif
> >>  config_host_data.set('CONFIG_LIBDAXCTL', libdaxctl.found())
> >>  config_host_data.set('CONFIG_LIBISCSI', libiscsi.found())
> >>  config_host_data.set('CONFIG_LIBNFS', libnfs.found())
> >> @@ -4285,6 +4303,7 @@ summary_info += {'PVRDMA support':    have_pvrdma}
> >>  summary_info += {'fdt support':       fdt_opt == 'disabled' ? false : 
> >> fdt_opt}
> >>  summary_info += {'libcap-ng support': libcap_ng}
> >>  summary_info += {'bpf support':       libbpf}
> >> +summary_info += {'AF_XDP support':    libxdp}
> >>  summary_info += {'rbd support':       rbd}
> >>  summary_info += {'smartcard support': cacard}
> >>  summary_info += {'U2F support':       u2f}
> >> diff --git a/meson_options.txt b/meson_options.txt
> >> index bbb5c7e886..f4e950ce6a 100644
> >> --- a/meson_options.txt
> >> +++ b/meson_options.txt
> >> @@ -120,6 +120,8 @@ option('avx512bw', type: 'feature', value: 'auto',
> >>  option('keyring', type: 'feature', value: 'auto',
> >>         description: 'Linux keyring support')
> >>
> >> +option('af_xdp', type : 'feature', value : 'auto',
> >> +       description: 'AF_XDP network backend support')
> >>  option('attr', type : 'feature', value : 'auto',
> >>         description: 'attr/xattr support')
> >>  option('auth_pam', type : 'feature', value : 'auto',
> >> diff --git a/net/af-xdp.c b/net/af-xdp.c
> >> new file mode 100644
> >> index 0000000000..265ba6b12e
> >> --- /dev/null
> >> +++ b/net/af-xdp.c
> >> @@ -0,0 +1,570 @@
> >> +/*
> >> + * AF_XDP network backend.
> >> + *
> >> + * Copyright (c) 2023 Red Hat, Inc.
> >> + *
> >> + * Authors:
> >> + *  Ilya Maximets <i.maxim...@ovn.org>
> >> + *
> >> + * This work is licensed under the terms of the GNU GPL, version 2 or 
> >> later.
> >> + * See the COPYING file in the top-level directory.
> >> + */
> >> +
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include <bpf/bpf.h>
> >> +#include <inttypes.h>
> >> +#include <linux/if_link.h>
> >> +#include <linux/if_xdp.h>
> >> +#include <net/if.h>
> >> +#include <xdp/xsk.h>
> >> +
> >> +#include "clients.h"
> >> +#include "monitor/monitor.h"
> >> +#include "net/net.h"
> >> +#include "qapi/error.h"
> >> +#include "qemu/cutils.h"
> >> +#include "qemu/error-report.h"
> >> +#include "qemu/iov.h"
> >> +#include "qemu/main-loop.h"
> >> +#include "qemu/memalign.h"
> >> +
> >> +
> >> +typedef struct AFXDPState {
> >> +    NetClientState       nc;
> >> +
> >> +    struct xsk_socket    *xsk;
> >> +    struct xsk_ring_cons rx;
> >> +    struct xsk_ring_prod tx;
> >> +    struct xsk_ring_cons cq;
> >> +    struct xsk_ring_prod fq;
> >> +
> >> +    char                 ifname[IFNAMSIZ];
> >> +    int                  ifindex;
> >> +    bool                 read_poll;
> >> +    bool                 write_poll;
> >> +    uint32_t             outstanding_tx;
> >> +
> >> +    uint64_t             *pool;
> >> +    uint32_t             n_pool;
> >> +    char                 *buffer;
> >> +    struct xsk_umem      *umem;
> >> +
> >> +    uint32_t             n_queues;
> >> +    uint32_t             xdp_flags;
> >> +    bool                 inhibit;
> >> +} AFXDPState;
> >> +
> >> +#define AF_XDP_BATCH_SIZE 64
> >> +
> >> +static void af_xdp_send(void *opaque);
> >> +static void af_xdp_writable(void *opaque);
> >> +
> >> +/* Set the event-loop handlers for the af-xdp backend. */
> >> +static void af_xdp_update_fd_handler(AFXDPState *s)
> >> +{
> >> +    qemu_set_fd_handler(xsk_socket__fd(s->xsk),
> >> +                        s->read_poll ? af_xdp_send : NULL,
> >> +                        s->write_poll ? af_xdp_writable : NULL,
> >> +                        s);
> >> +}
> >> +
> >> +/* Update the read handler. */
> >> +static void af_xdp_read_poll(AFXDPState *s, bool enable)
> >> +{
> >> +    if (s->read_poll != enable) {
> >> +        s->read_poll = enable;
> >> +        af_xdp_update_fd_handler(s);
> >> +    }
> >> +}
> >> +
> >> +/* Update the write handler. */
> >> +static void af_xdp_write_poll(AFXDPState *s, bool enable)
> >> +{
> >> +    if (s->write_poll != enable) {
> >> +        s->write_poll = enable;
> >> +        af_xdp_update_fd_handler(s);
> >> +    }
> >> +}
> >> +
> >> +static void af_xdp_poll(NetClientState *nc, bool enable)
> >> +{
> >> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
> >> +
> >> +    if (s->read_poll != enable || s->write_poll != enable) {
> >> +        s->write_poll = enable;
> >> +        s->read_poll  = enable;
> >> +        af_xdp_update_fd_handler(s);
> >> +    }
> >> +}
> >> +
> >> +static void af_xdp_complete_tx(AFXDPState *s)
> >> +{
> >> +    uint32_t idx = 0;
> >> +    uint32_t done, i;
> >> +    uint64_t *addr;
> >> +
> >> +    done = xsk_ring_cons__peek(&s->cq, XSK_RING_CONS__DEFAULT_NUM_DESCS, 
> >> &idx);
> >> +
> >> +    for (i = 0; i < done; i++) {
> >> +        addr = (void *) xsk_ring_cons__comp_addr(&s->cq, idx++);
> >> +        s->pool[s->n_pool++] = *addr;
> >> +        s->outstanding_tx--;
> >> +    }
> >> +
> >> +    if (done) {
> >> +        xsk_ring_cons__release(&s->cq, done);
> >> +    }
> >> +}
> >> +
> >> +/*
> >> + * The fd_write() callback, invoked if the fd is marked as writable
> >> + * after a poll.
> >> + */
> >> +static void af_xdp_writable(void *opaque)
> >> +{
> >> +    AFXDPState *s = opaque;
> >> +
> >> +    /* Try to recover buffers that are already sent. */
> >> +    af_xdp_complete_tx(s);
> >> +
> >> +    /*
> >> +     * Unregister the handler, unless we still have packets to transmit
> >> +     * and kernel needs a wake up.
> >> +     */
> >> +    if (!s->outstanding_tx || !xsk_ring_prod__needs_wakeup(&s->tx)) {
> >> +        af_xdp_write_poll(s, false);
> >> +    }
> >> +
> >> +    /* Flush any buffered packets. */
> >> +    qemu_flush_queued_packets(&s->nc);
> >> +}
> >> +
> >> +static ssize_t af_xdp_receive(NetClientState *nc,
> >> +                              const uint8_t *buf, size_t size)
> >> +{
> >> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
> >> +    struct xdp_desc *desc;
> >> +    uint32_t idx;
> >> +    void *data;
> >> +
> >> +    /* Try to recover buffers that are already sent. */
> >> +    af_xdp_complete_tx(s);
> >> +
> >> +    if (size > XSK_UMEM__DEFAULT_FRAME_SIZE) {
> >> +        /* We can't transmit packet this size... */
> >> +        return size;
> >> +    }
> >> +
> >> +    if (!s->n_pool || !xsk_ring_prod__reserve(&s->tx, 1, &idx)) {
> >> +        /*
> >> +         * Out of buffers or space in tx ring.  Poll until we can write.
> >> +         * This will also kick the Tx, if it was waiting on CQ.
> >> +         */
> >> +        af_xdp_write_poll(s, true);
> >> +        return 0;
> >> +    }
> >> +
> >> +    desc = xsk_ring_prod__tx_desc(&s->tx, idx);
> >> +    desc->addr = s->pool[--s->n_pool];
> >> +    desc->len = size;
> >> +
> >> +    data = xsk_umem__get_data(s->buffer, desc->addr);
> >> +    memcpy(data, buf, size);
> >> +
> >> +    xsk_ring_prod__submit(&s->tx, 1);
> >> +    s->outstanding_tx++;
> >> +
> >> +    if (xsk_ring_prod__needs_wakeup(&s->tx)) {
> >> +        af_xdp_write_poll(s, true);
> >> +    }
> >> +
> >> +    return size;
> >> +}
> >> +
> >> +/*
> >> + * Complete a previous send (backend --> guest) and enable the
> >> + * fd_read callback.
> >> + */
> >> +static void af_xdp_send_completed(NetClientState *nc, ssize_t len)
> >> +{
> >> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
> >> +
> >> +    af_xdp_read_poll(s, true);
> >> +}
> >> +
> >> +static void af_xdp_fq_refill(AFXDPState *s, uint32_t n)
> >> +{
> >> +    uint32_t i, idx = 0;
> >> +
> >> +    /* Leave one packet for Tx, just in case. */
> >> +    if (s->n_pool < n + 1) {
> >> +        n = s->n_pool;
> >> +    }
> >> +
> >> +    if (!n || !xsk_ring_prod__reserve(&s->fq, n, &idx)) {
> >> +        return;
> >> +    }
> >> +
> >> +    for (i = 0; i < n; i++) {
> >> +        *xsk_ring_prod__fill_addr(&s->fq, idx++) = s->pool[--s->n_pool];
> >> +    }
> >> +    xsk_ring_prod__submit(&s->fq, n);
> >> +
> >> +    if (xsk_ring_prod__needs_wakeup(&s->fq)) {
> >> +        /* Receive was blocked by not having enough buffers.  Wake it up. 
> >> */
> >> +        af_xdp_read_poll(s, true);
> >> +    }
> >> +}
> >> +
> >> +static void af_xdp_send(void *opaque)
> >> +{
> >> +    uint32_t i, n_rx, idx = 0;
> >> +    AFXDPState *s = opaque;
> >> +
> >> +    n_rx = xsk_ring_cons__peek(&s->rx, AF_XDP_BATCH_SIZE, &idx);
> >> +    if (!n_rx) {
> >> +        return;
> >> +    }
> >> +
> >> +    for (i = 0; i < n_rx; i++) {
> >> +        const struct xdp_desc *desc;
> >> +        struct iovec iov;
> >> +
> >> +        desc = xsk_ring_cons__rx_desc(&s->rx, idx++);
> >> +
> >> +        iov.iov_base = xsk_umem__get_data(s->buffer, desc->addr);
> >> +        iov.iov_len = desc->len;
> >> +
> >> +        s->pool[s->n_pool++] = desc->addr;
> >> +
> >> +        if (!qemu_sendv_packet_async(&s->nc, &iov, 1,
> >> +                                     af_xdp_send_completed)) {
> >> +            /*
> >> +             * The peer does not receive anymore.  Packet is queued, stop
> >> +             * reading from the backend until af_xdp_send_completed().
> >> +             */
> >> +            af_xdp_read_poll(s, false);
> >> +
> >> +            /* Re-peek the descriptors to not break the ring cache. */
> >> +            xsk_ring_cons__cancel(&s->rx, n_rx);
> >> +            n_rx = xsk_ring_cons__peek(&s->rx, i + 1, &idx);
> >
> > The code turns out to be hard to read here.
> >
> > 1) This seems to undo the peek (usually peek doesn't touch the
> > prod/consumer but it seems not what xsk_ring_cons__peek()) did:
>
> Yeah, it's unfortunate that the peek() function changes the internal
> state, but that is what we have...
>
> >
> > static inline __u32 xsk_ring_cons__peek(struct xsk_ring_cons *cons,
> > __u32 nb, __u32 *idx)
> > {
> >   __u32 entries = xsk_cons_nb_avail(cons, nb);
> >
> > if (entries > 0) {
> >                 *idx = cons->cached_cons;
> >                 cons->cached_cons += entries;
> >         }
> >
> >         return entries;
> > }
> >
> > 2) It looks to me a partial rollback is sufficient?
> >
> > xsk_ring_cons__cancel(n_rx - i + 1)?
>
> Good point.  Should work.  It should be n_rx - i - 1 though, if I'm not
> mistaken.  So:
>
>     xsk_ring_cons__cancel(n_rx - i - 1);
>     n_rx = i + 1;
>
> I'm not sure if that is much easier to read, but that's OK.  Should be
> a touch faster as well.  What do you think?

Let's do that please.

>
> >
> >> +            g_assert(n_rx == i + 1);
> >> +            break;
> >> +        }
> >> +    }
> >> +
> >> +    /* Release actually sent descriptors and try to re-fill. */
> >> +    xsk_ring_cons__release(&s->rx, n_rx);
> >> +    af_xdp_fq_refill(s, AF_XDP_BATCH_SIZE);
> >> +}
> >> +
> >> +/* Flush and close. */
> >> +static void af_xdp_cleanup(NetClientState *nc)
> >> +{
> >> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
> >> +
> >> +    qemu_purge_queued_packets(nc);
> >> +
> >> +    af_xdp_poll(nc, false);
> >> +
> >> +    xsk_socket__delete(s->xsk);
> >> +    s->xsk = NULL;
> >> +    g_free(s->pool);
> >> +    s->pool = NULL;
> >> +    xsk_umem__delete(s->umem);
> >> +    s->umem = NULL;
> >> +    qemu_vfree(s->buffer);
> >> +    s->buffer = NULL;
> >> +
> >> +    /* Remove the program if it's the last open queue. */
> >> +    if (!s->inhibit && nc->queue_index == s->n_queues - 1 && s->xdp_flags
> >> +        && bpf_xdp_detach(s->ifindex, s->xdp_flags, NULL) != 0) {
> >> +        fprintf(stderr,
> >> +                "af-xdp: unable to remove XDP program from '%s', ifindex: 
> >> %d\n",
> >> +                s->ifname, s->ifindex);
> >> +    }
> >> +}
> >> +
> >> +static int af_xdp_umem_create(AFXDPState *s, int sock_fd, Error **errp)
> >> +{
> >> +    struct xsk_umem_config config = {
> >> +        .fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
> >> +        .comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
> >> +        .frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
> >> +        .frame_headroom = 0,
> >> +    };
> >> +    uint64_t n_descs;
> >> +    uint64_t size;
> >> +    int64_t i;
> >> +    int ret;
> >> +
> >> +    /* Number of descriptors if all 4 queues (rx, tx, cq, fq) are full. */
> >> +    n_descs = (XSK_RING_PROD__DEFAULT_NUM_DESCS
> >> +               + XSK_RING_CONS__DEFAULT_NUM_DESCS) * 2;
> >> +    size = n_descs * XSK_UMEM__DEFAULT_FRAME_SIZE;
> >> +
> >> +    s->buffer = qemu_memalign(qemu_real_host_page_size(), size);
> >> +    memset(s->buffer, 0, size);
> >> +
> >> +    if (sock_fd < 0) {
> >> +        ret = xsk_umem__create(&s->umem, s->buffer, size,
> >> +                               &s->fq, &s->cq, &config);
> >> +    } else {
> >> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
> >> +        ret = xsk_umem__create_with_fd(&s->umem, sock_fd, s->buffer, size,
> >> +                                       &s->fq, &s->cq, &config);
> >> +#else
> >
> > So sock_fds without HAVE_XSK_UMEM__CREATE_WITH_FD won't work. We'd better
> >
> > 1) disable sock_fds without HAVE_XSK_UMEM__CREATE_WITH_FD
>
> The qapi property is conditionally defined, so users will not be able to
> set sock-fds if not supported.  And qemu will complain about unknown
> property.  That should be enough?

Yes.

>
> >
> > or
> >
> > 2) disable af_xdp without HAVE_XSK_UMEM__CREATE_WITH_FD
>
> If we require libxdp 1.4 that will be the case.
>
> >
> >> +        ret = -1;
> >> +        errno = EINVAL;
> >> +#endif
> >> +    }
> >> +
> >> +    if (ret) {
> >> +        qemu_vfree(s->buffer);
> >> +        error_setg_errno(errp, errno,
> >> +                         "failed to create umem for %s queue_index: %d",
> >> +                         s->ifname, s->nc.queue_index);
> >> +        return -1;
> >> +    }
> >> +
> >> +    s->pool = g_new(uint64_t, n_descs);
> >> +    /* Fill the pool in the opposite order, because it's a LIFO queue. */
> >> +    for (i = n_descs; i >= 0; i--) {
> >> +        s->pool[i] = i * XSK_UMEM__DEFAULT_FRAME_SIZE;
> >> +    }
> >> +    s->n_pool = n_descs;
> >> +
> >> +    af_xdp_fq_refill(s, XSK_RING_PROD__DEFAULT_NUM_DESCS);
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static int af_xdp_socket_create(AFXDPState *s,
> >> +                                const NetdevAFXDPOptions *opts,
> >> +                                int xsks_map_fd, Error **errp)
> >> +{
> >> +    struct xsk_socket_config cfg = {
> >> +        .rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
> >> +        .tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
> >> +        .libxdp_flags = 0,
> >> +        .bind_flags = XDP_USE_NEED_WAKEUP,
> >> +        .xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST,
> >> +    };
> >> +    int queue_id, error = 0;
> >> +
> >> +    s->inhibit = opts->has_inhibit && opts->inhibit;
> >> +    if (s->inhibit) {
> >> +        cfg.libxdp_flags |= XSK_LIBXDP_FLAGS__INHIBIT_PROG_LOAD;
> >> +    }
> >> +
> >> +    if (opts->has_force_copy && opts->force_copy) {
> >> +        cfg.bind_flags |= XDP_COPY;
> >> +    }
> >> +
> >> +    queue_id = s->nc.queue_index;
> >> +    if (opts->has_start_queue && opts->start_queue > 0) {
> >> +        queue_id += opts->start_queue;
> >> +    }
> >> +
> >> +    if (opts->has_mode) {
> >> +        /* Specific mode requested. */
> >> +        cfg.xdp_flags |= (opts->mode == AFXDP_MODE_NATIVE)
> >> +                         ? XDP_FLAGS_DRV_MODE : XDP_FLAGS_SKB_MODE;
> >> +        if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
> >> +                               s->umem, &s->rx, &s->tx, &cfg)) {
> >> +            error = errno;
> >> +        }
> >> +    } else {
> >> +        /* No mode requested, try native first. */
> >> +        cfg.xdp_flags |= XDP_FLAGS_DRV_MODE;
> >> +
> >> +        if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
> >> +                               s->umem, &s->rx, &s->tx, &cfg)) {
> >> +            /* Can't use native mode, try skb. */
> >> +            cfg.xdp_flags &= ~XDP_FLAGS_DRV_MODE;
> >> +            cfg.xdp_flags |= XDP_FLAGS_SKB_MODE;
> >> +
> >> +            if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
> >> +                                   s->umem, &s->rx, &s->tx, &cfg)) {
> >> +                error = errno;
> >> +            }
> >> +        }
> >> +    }
> >> +
> >> +    if (error) {
> >> +        error_setg_errno(errp, error,
> >> +                         "failed to create AF_XDP socket for %s queue_id: 
> >> %d",
> >> +                         s->ifname, queue_id);
> >> +        return -1;
> >> +    }
> >> +
> >> +    if (s->inhibit && xsks_map_fd >= 0) {
> >> +        int xsk_fd = xsk_socket__fd(s->xsk);
> >> +
> >> +        /* Need to update the map manually, libxdp skipped that step. */
> >> +        error = bpf_map_update_elem(xsks_map_fd, &queue_id, &xsk_fd, 0);
> >> +        if (error) {
> >> +            error_setg_errno(errp, error,
> >> +                             "failed to update xsks map for %s queue_id: 
> >> %d",
> >> +                             s->ifname, queue_id);
> >> +            return -1;
> >> +        }
> >> +    }
> >> +
> >> +    s->xdp_flags = cfg.xdp_flags;
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +/* NetClientInfo methods. */
> >> +static NetClientInfo net_af_xdp_info = {
> >> +    .type = NET_CLIENT_DRIVER_AF_XDP,
> >> +    .size = sizeof(AFXDPState),
> >> +    .receive = af_xdp_receive,
> >> +    .poll = af_xdp_poll,
> >> +    .cleanup = af_xdp_cleanup,
> >> +};
> >> +
> >> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
> >> +static int *parse_socket_fds(const char *sock_fds_str,
> >> +                             int64_t n_expected, Error **errp)
> >> +{
> >> +    gchar **substrings = g_strsplit(sock_fds_str, ":", -1);
> >> +    int64_t i, n_sock_fds = g_strv_length(substrings);
> >> +    int *sock_fds = NULL;
> >> +
> >> +    if (n_sock_fds != n_expected) {
> >> +        error_setg(errp, "expected %"PRIi64" socket fds, got %"PRIi64,
> >> +                   n_expected, n_sock_fds);
> >> +        goto exit;
> >> +    }
> >> +
> >> +    sock_fds = g_new(int, n_sock_fds);
> >> +
> >> +    for (i = 0; i < n_sock_fds; i++) {
> >> +        sock_fds[i] = monitor_fd_param(monitor_cur(), substrings[i], 
> >> errp);
> >> +        if (sock_fds[i] < 0) {
> >> +            g_free(sock_fds);
> >> +            sock_fds = NULL;
> >> +            goto exit;
> >> +        }
> >> +    }
> >> +
> >> +exit:
> >> +    g_strfreev(substrings);
> >> +    return sock_fds;
> >> +}
> >> +#endif
> >> +
> >> +/*
> >> + * The exported init function.
> >> + *
> >> + * ... -net af-xdp,ifname="..."
> >
> > This is the legacy command line, let's say -netdev af-xdp,...
>
> Sure.
>
> >
> >> + */
> >> +int net_init_af_xdp(const Netdev *netdev,
> >> +                    const char *name, NetClientState *peer, Error **errp)
> >> +{
> >> +    const NetdevAFXDPOptions *opts = &netdev->u.af_xdp;
> >> +    NetClientState *nc, *nc0 = NULL;
> >> +    unsigned int ifindex;
> >> +    uint32_t prog_id = 0;
> >> +    int *sock_fds = NULL;
> >> +    int xsks_map_fd = -1;
> >> +    int64_t i, queues;
> >> +    Error *err = NULL;
> >> +    AFXDPState *s;
> >> +
> >> +    ifindex = if_nametoindex(opts->ifname);
> >> +    if (!ifindex) {
> >> +        error_setg_errno(errp, errno, "failed to get ifindex for '%s'",
> >> +                         opts->ifname);
> >> +        return -1;
> >> +    }
> >> +
> >> +    queues = opts->has_queues ? opts->queues : 1;
> >> +    if (queues < 1) {
> >> +        error_setg(errp, "invalid number of queues (%" PRIi64 ") for 
> >> '%s'",
> >> +                   queues, opts->ifname);
> >> +        return -1;
> >> +    }
> >> +
> >> +#ifndef HAVE_XSK_UMEM__CREATE_WITH_FD
> >> +    if ((opts->has_inhibit && opts->inhibit) != !!opts->xsks_map_fd) {
> >> +        error_setg(errp, "expected 'inhibit=on' and 'xsks-map-fd' 
> >> together");
> >> +        return -1;
> >> +    }
> >> +#else
> >> +    if ((opts->has_inhibit && opts->inhibit)
> >> +        != (opts->xsks_map_fd || opts->sock_fds)) {
> >> +        error_setg(errp, "'inhibit=on' should be used together with "
> >> +                         "'sock-fds' or 'xsks-map-fd'");
> >> +        return -1;
> >> +    }
> >> +
> >> +    if (opts->xsks_map_fd && opts->sock_fds) {
> >> +        error_setg(errp, "'sock-fds' and 'xsks-map-fd' are mutually 
> >> exclusive");
> >> +        return -1;
> >> +    }
> >> +
> >> +    if (opts->sock_fds) {
> >> +        sock_fds = parse_socket_fds(opts->sock_fds, queues, errp);
> >> +        if (!sock_fds) {
> >> +            return -1;
> >> +        }
> >> +    }
> >> +#endif
> >> +
> >> +    if (opts->xsks_map_fd) {
> >> +        xsks_map_fd = monitor_fd_param(monitor_cur(), opts->xsks_map_fd, 
> >> errp);
> >> +        if (xsks_map_fd < 0) {
> >> +            goto err;
> >> +        }
> >> +    }
> >> +
> >> +    for (i = 0; i < queues; i++) {
> >> +        nc = qemu_new_net_client(&net_af_xdp_info, peer, "af-xdp", name);
> >> +        qemu_set_info_str(nc, "af-xdp%"PRIi64" to %s", i, opts->ifname);
> >> +        nc->queue_index = i;
> >> +
> >> +        if (!nc0) {
> >> +            nc0 = nc;
> >> +        }
> >> +
> >> +        s = DO_UPCAST(AFXDPState, nc, nc);
> >> +
> >> +        pstrcpy(s->ifname, sizeof(s->ifname), opts->ifname);
> >> +        s->ifindex = ifindex;
> >> +        s->n_queues = queues;
> >> +
> >> +        if (af_xdp_umem_create(s, sock_fds ? sock_fds[i] : -1, errp)
> >> +            || af_xdp_socket_create(s, opts, xsks_map_fd, errp)) {
> >> +            /* Make sure the XDP program will be removed. */
> >> +            s->n_queues = i;
> >> +            error_propagate(errp, err);
> >> +            goto err;
> >> +        }
> >> +    }
> >> +
> >> +    if (nc0) {
> >> +        s = DO_UPCAST(AFXDPState, nc, nc0);
> >> +        if (bpf_xdp_query_id(s->ifindex, s->xdp_flags, &prog_id) || 
> >> !prog_id) {
> >> +            error_setg_errno(errp, errno,
> >> +                             "no XDP program loaded on '%s', ifindex: %d",
> >> +                             s->ifname, s->ifindex);
> >> +            goto err;
> >> +        }
> >> +    }
> >> +
> >> +    af_xdp_read_poll(s, true); /* Initially only poll for reads. */
> >> +
> >> +    return 0;
> >> +
> >> +err:
> >> +    g_free(sock_fds);
> >> +    if (nc0) {
> >> +        qemu_del_net_client(nc0);
> >> +    }
> >> +
> >> +    return -1;
> >> +}
> >> diff --git a/net/clients.h b/net/clients.h
> >> index ed8bdfff1e..be53794582 100644
> >> --- a/net/clients.h
> >> +++ b/net/clients.h
> >> @@ -64,6 +64,11 @@ int net_init_netmap(const Netdev *netdev, const char 
> >> *name,
> >>                      NetClientState *peer, Error **errp);
> >>  #endif
> >>
> >> +#ifdef CONFIG_AF_XDP
> >> +int net_init_af_xdp(const Netdev *netdev, const char *name,
> >> +                    NetClientState *peer, Error **errp);
> >> +#endif
> >> +
> >>  int net_init_vhost_user(const Netdev *netdev, const char *name,
> >>                          NetClientState *peer, Error **errp);
> >>
> >> diff --git a/net/meson.build b/net/meson.build
> >> index bdf564a57b..61628d4684 100644
> >> --- a/net/meson.build
> >> +++ b/net/meson.build
> >> @@ -36,6 +36,9 @@ system_ss.add(when: vde, if_true: files('vde.c'))
> >>  if have_netmap
> >>    system_ss.add(files('netmap.c'))
> >>  endif
> >> +
> >> +system_ss.add(when: libxdp, if_true: files('af-xdp.c'))
> >> +
> >>  if have_vhost_net_user
> >>    system_ss.add(when: 'CONFIG_VIRTIO_NET', if_true: 
> >> files('vhost-user.c'), if_false: files('vhost-user-stub.c'))
> >>    system_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-user-stub.c'))
> >> diff --git a/net/net.c b/net/net.c
> >> index 6492ad530e..127f70932b 100644
> >> --- a/net/net.c
> >> +++ b/net/net.c
> >> @@ -1082,6 +1082,9 @@ static int (* const 
> >> net_client_init_fun[NET_CLIENT_DRIVER__MAX])(
> >>  #ifdef CONFIG_NETMAP
> >>          [NET_CLIENT_DRIVER_NETMAP]    = net_init_netmap,
> >>  #endif
> >> +#ifdef CONFIG_AF_XDP
> >> +        [NET_CLIENT_DRIVER_AF_XDP]    = net_init_af_xdp,
> >> +#endif
> >>  #ifdef CONFIG_NET_BRIDGE
> >>          [NET_CLIENT_DRIVER_BRIDGE]    = net_init_bridge,
> >>  #endif
> >> @@ -1186,6 +1189,9 @@ void show_netdevs(void)
> >>  #ifdef CONFIG_NETMAP
> >>          "netmap",
> >>  #endif
> >> +#ifdef CONFIG_AF_XDP
> >> +        "af-xdp",
> >> +#endif
> >>  #ifdef CONFIG_POSIX
> >>          "vhost-user",
> >>  #endif
> >> diff --git a/qapi/net.json b/qapi/net.json
> >> index db67501308..88f2c982c2 100644
> >> --- a/qapi/net.json
> >> +++ b/qapi/net.json
> >> @@ -408,6 +408,62 @@
> >>      'ifname':     'str',
> >>      '*devname':    'str' } }
> >>
> >> +##
> >> +# @AFXDPMode:
> >> +#
> >> +# Attach mode for a default XDP program
> >> +#
> >> +# @skb: generic mode, no driver support necessary
> >> +#
> >> +# @native: DRV mode, program is attached to a driver, packets are passed 
> >> to
> >> +#     the socket without allocation of skb.
> >> +#
> >> +# Since: 8.1
> >
> > I'd make it for 8.2.
>
> OK.
>
> >
> >> +##
> >> +{ 'enum': 'AFXDPMode',
> >> +  'data': [ 'native', 'skb' ] }
> >> +
> >> +##
> >> +# @NetdevAFXDPOptions:
> >> +#
> >> +# AF_XDP network backend
> >> +#
> >> +# @ifname: The name of an existing network interface.
> >> +#
> >> +# @mode: Attach mode for a default XDP program.  If not specified, then
> >> +#     'native' will be tried first, then 'skb'.
> >> +#
> >> +# @force-copy: Force XDP copy mode even if device supports zero-copy.
> >> +#     (default: false)
> >> +#
> >> +# @queues: number of queues to be used for multiqueue interfaces 
> >> (default: 1).
> >> +#
> >> +# @start-queue: Use @queues starting from this queue number (default: 0).
> >> +#
> >> +# @inhibit: Don't load a default XDP program, use one already loaded to
> >> +#     the interface (default: false).  Requires @sock-fds or @xsks-map-fd.
> >> +#
> >> +# @sock-fds: A colon (:) separated list of file descriptors for already 
> >> open
> >> +#     but not bound AF_XDP sockets in the queue order.  One fd per queue.
> >> +#     These descriptors should already be added into XDP socket map for
> >> +#     corresponding queues.  Requires @inhibit.
> >> +#
> >> +# @xsks-map-fd: A file descriptor for an already open XDP socket map in
> >> +#     the already loaded XDP program.  Requires @inhibit.
> >> +#
> >> +# Since: 8.1
> >> +##
> >> +{ 'struct': 'NetdevAFXDPOptions',
> >> +  'data': {
> >> +    'ifname':       'str',
> >> +    '*mode':        'AFXDPMode',
> >> +    '*force-copy':  'bool',
> >> +    '*queues':      'int',
> >> +    '*start-queue': 'int',
> >> +    '*inhibit':     'bool',
> >> +    '*sock-fds':    { 'type': 'str', 'if': 
> >> 'HAVE_XSK_UMEM__CREATE_WITH_FD' },
>
> The paramater is defined conditionally hare and it will not be
> compiled in, if HAVE_XSK_UMEM__CREATE_WITH_FD is not defined.

Right, I missed that.

>
> >> +    '*xsks-map-fd': 'str' } }
> >> +
> >>  ##
> >>  # @NetdevVhostUserOptions:
> >>  #
> >> @@ -642,13 +698,14 @@
> >>  # @vmnet-bridged: since 7.1
> >>  # @stream: since 7.2
> >>  # @dgram: since 7.2
> >> +# @af-xdp: since 8.1
> >>  #
> >>  # Since: 2.7
> >>  ##
> >>  { 'enum': 'NetClientDriver',
> >>    'data': [ 'none', 'nic', 'user', 'tap', 'l2tpv3', 'socket', 'stream',
> >>              'dgram', 'vde', 'bridge', 'hubport', 'netmap', 'vhost-user',
> >> -            'vhost-vdpa',
> >> +            'vhost-vdpa', 'af-xdp',
> >>              { 'name': 'vmnet-host', 'if': 'CONFIG_VMNET' },
> >>              { 'name': 'vmnet-shared', 'if': 'CONFIG_VMNET' },
> >>              { 'name': 'vmnet-bridged', 'if': 'CONFIG_VMNET' }] }
> >> @@ -680,6 +737,7 @@
> >>      'bridge':   'NetdevBridgeOptions',
> >>      'hubport':  'NetdevHubPortOptions',
> >>      'netmap':   'NetdevNetmapOptions',
> >> +    'af-xdp':   'NetdevAFXDPOptions',
> >>      'vhost-user': 'NetdevVhostUserOptions',
> >>      'vhost-vdpa': 'NetdevVhostVDPAOptions',
> >>      'vmnet-host': { 'type': 'NetdevVmnetHostOptions',
> >> diff --git a/qemu-options.hx b/qemu-options.hx
> >> index b57489d7ca..d91610701c 100644
> >> --- a/qemu-options.hx
> >> +++ b/qemu-options.hx
> >> @@ -2856,6 +2856,25 @@ DEF("netdev", HAS_ARG, QEMU_OPTION_netdev,
> >>      "                VALE port (created on the fly) called 'name' 
> >> ('nmname' is name of the \n"
> >>      "                netmap device, defaults to '/dev/netmap')\n"
> >>  #endif
> >> +#ifdef CONFIG_AF_XDP
> >> +    "-netdev 
> >> af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off]\n"
> >> +    "         
> >> [,queues=n][,start-queue=m][,inhibit=on|off][,xsks-map-fd=k]\n"
> >> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
> >> +    "         [,sock-fds=x:y:...:z]\n"
> >> +#endif
> >> +    "                attach to the existing network interface 'name' with 
> >> AF_XDP socket\n"
> >> +    "                use 'mode=MODE' to specify an XDP program attach 
> >> mode\n"
> >> +    "                use 'force-copy=on|off' to force XDP copy mode even 
> >> if device supports zero-copy (default: off)\n"
> >> +    "                use 'inhibit=on|off' to inhibit loading of a default 
> >> XDP program (default: off)\n"
> >> +    "                with inhibit=on,\n"
> >> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
> >> +    "                  use 'sock-fds' to provide file descriptors for 
> >> already open XDP sockets\n"
> >> +    "                  added to a socket map in XDP program.  One socket 
> >> per queue.  Or\n"
> >> +#endif
> >> +    "                  use 'xsks-map-fd=k' to provide a file descriptor 
> >> for xsks map\n"
> >> +    "                use 'queues=n' to specify how many queues of a 
> >> multiqueue interface should be used\n"
> >> +    "                use 'start-queue=m' to specify the first queue that 
> >> should be used\n"
> >> +#endif
> >>  #ifdef CONFIG_POSIX
> >>      "-netdev vhost-user,id=str,chardev=dev[,vhostforce=on|off]\n"
> >>      "                configure a vhost-user network, backed by a chardev 
> >> 'dev'\n"
> >> @@ -2901,6 +2920,9 @@ DEF("nic", HAS_ARG, QEMU_OPTION_nic,
> >>  #ifdef CONFIG_NETMAP
> >>      "netmap|"
> >>  #endif
> >> +#ifdef CONFIG_AF_XDP
> >> +    "af-xdp|"
> >> +#endif
> >>  #ifdef CONFIG_POSIX
> >>      "vhost-user|"
> >>  #endif
> >> @@ -2929,6 +2951,9 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
> >>  #ifdef CONFIG_NETMAP
> >>      "netmap|"
> >>  #endif
> >> +#ifdef CONFIG_AF_XDP
> >> +    "af-xdp|"
> >> +#endif
> >>  #ifdef CONFIG_VMNET
> >>      "vmnet-host|vmnet-shared|vmnet-bridged|"
> >>  #endif
> >> @@ -2936,7 +2961,7 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
> >>      "                old way to initialize a host network interface\n"
> >>      "                (use the -netdev option if possible instead)\n", 
> >> QEMU_ARCH_ALL)
> >>  SRST
> >> -``-nic 
> >> [tap|bridge|user|l2tpv3|vde|netmap|vhost-user|socket][,...][,mac=macaddr][,model=mn]``
> >> +``-nic 
> >> [tap|bridge|user|l2tpv3|vde|netmap|af-xdp|vhost-user|socket][,...][,mac=macaddr][,model=mn]``
> >>      This option is a shortcut for configuring both the on-board
> >>      (default) guest NIC hardware and the host network backend in one go.
> >>      The host backend options are the same as with the corresponding
> >> @@ -3350,6 +3375,62 @@ SRST
> >>          # launch QEMU instance
> >>          |qemu_system| linux.img -nic vde,sock=/tmp/myswitch
> >>
> >> +``-netdev 
> >> af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off][,queues=n][,start-queue=m][,inhibit=on|off][,xsks-map-fd=k][,sock-fds=x:y:...:z]``
> >> +    Configure AF_XDP backend to connect to a network interface 'name'
> >> +    using AF_XDP socket.  A specific program attach mode for a default
> >> +    XDP program can be forced with 'mode', defaults to best-effort,
> >> +    where the likely most performant mode will be in use.  Number of 
> >> queues
> >> +    'n' should generally match the number or queues in the interface,
> >> +    defaults to 1.  Traffic arriving on non-configured device queues will
> >> +    not be delivered to the network backend.
> >> +
> >> +    .. parsed-literal::
> >> +
> >> +        # set number of queues to 1
> >> +        ethtool -L eth0 combined 4
> >> +        # launch QEMU instance
> >> +        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
> >> +            -netdev af-xdp,id=n1,ifname=eth0,queues=4
> >> +
> >> +    'start-queue' option can be specified if a particular range of queues
> >> +    [m, m + n] should be in use.  For example, this is necessary in order
> >> +    to use MLX NICs in native mode.  The driver will create a separate set
> >> +    of queues on top of regular ones, and only these queues can be used
> >> +    for AF_XDP sockets.  MLX NICs will also require an additional traffic
> >> +    redirection with ethtool to these queues.
> >
> > Let's avoid mentioning any vendor name unless the emulation is done
> > for a specific one.
>
> AFAIK, MLX/NVIDIA is the only vendor that implements XDP queues in this
> strange fashion, so these are the only NICs that will not work without
> start-queue configuration and the extra traffic re-direction.  Unfortunately,
> this is also not documented anywhere, so you just have to know that it works
> this way...
>
> So, one one hand, I understand that it's not the place for vendor-specific
> docs.  On the other, it gives qemu users a chance to not waste a lot of time
> trying to figure out why everything is configured correctly, but the traffic
> doesn't flow.
>
> Maybe something like this:
>
>     'start-queue' option can be specified if a particular range of queues
>     [m, m + n] should be in use.  For example, this is may be necessary in
>     order to use certain NICs in native mode.  Kernel allows the driver to
>     create a separate set of XDP queues on top of regular ones, and only
>     these queues can be used for AF_XDP sockets.  NICs that work this way
>     may also require an additional traffic redirection with ethtool to these
>     special queues.
>
>     .. parsed-literal::
>
>         # set number of queues to 1
>         ethtool -L eth0 combined 1
>         # redirect all the traffic to the second queue (id: 1)
>         # note: drivers may require non-empty key/mask pair.
>         ethtool -N eth0 flow-type ether \\
>             dst 00:00:00:00:00:00 m FF:FF:FF:FF:FF:FE action 1
>         ethtool -N eth0 flow-type ether \\
>             dst 00:00:00:00:00:01 m FF:FF:FF:FF:FF:FE action 1
>         # launch QEMU instance
>         |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
>             -netdev af-xdp,id=n1,ifname=eth0,queues=1,start-queue=1
>
> What do you think?

This is great.

>
> >
> >> + E.g.:
> >> +
> >> +    .. parsed-literal::
> >> +
> >> +        # set number of queues to 1
> >> +        ethtool -L eth0 combined 1
> >> +        # redirect all the traffic to the second queue (id: 1)
> >> +        # note: mlx5 driver requires non-empty key/mask pair.
> >> +        ethtool -N eth0 flow-type ether \\
> >> +            dst 00:00:00:00:00:00 m FF:FF:FF:FF:FF:FE action 1
> >> +        ethtool -N eth0 flow-type ether \\
> >> +            dst 00:00:00:00:00:01 m FF:FF:FF:FF:FF:FE action 1
> >> +        # launch QEMU instance
> >> +        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
> >> +            -netdev af-xdp,id=n1,ifname=eth0,queues=1,start-queue=1
> >> +
> >> +    XDP program can also be loaded externally.  In this case 'inhibit' 
> >> option
> >> +    should be set to 'on' and 'xsks-map-fd' provided with a file 
> >> descriptor
> >> +    for an open XDP socket map of that program, or 'sock-fds' with file
> >> +    descriptors for already open but not bound XDP sockets already added 
> >> to a
> >> +    socket map for corresponding queues.  One socket per queue.
> >> +
> >> +    .. parsed-literal::
> >> +
> >> +        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
> >> +            -netdev 
> >> af-xdp,id=n1,ifname=eth0,queues=3,inhibit=on,sock-fds=15:16:17
> >> +
> >> +    With 'inhibit=on' and 'sock-fds', QEMU process will only require 32 MB
> >> +    of locked memory (RLIMIT_MEMLOCK) per queue or CAP_IPC_LOCK 
> >> capability.
> >> +    With 'inhibit=on' and 'xsks-map-fd' it will additionally require
> >> +    CAP_NET_RAW capability.  With 'inhibit=off', CAP_SYS/NET_ADMIN should 
> >> be
> >> +    added as well.
> >
> > This requires the synchronization with the code changes, I suggest not
> > to describe:
> >
> > 1) actual numbers of memory requirement
> > 2) capabilites required for each mode, we don't do that for other
> > netdev like tap.
>
> Makes sense.  So, should we just strike the paragraph above entirely?

I think so.

Thanks

>
> >
> > Thanks
> >
> >
> >
> >
> >> +
> >> +
> >>  ``-netdev vhost-user,chardev=id[,vhostforce=on|off][,queues=n]``
> >>      Establish a vhost-user netdev, backed by a chardev id. The chardev
> >>      should be a unix domain socket backed one. The vhost-user uses a
> >> diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure 
> >> b/scripts/ci/org.centos/stream/8/x86_64/configure
> >> index d02b09a4b9..7585c4c4ed 100755
> >> --- a/scripts/ci/org.centos/stream/8/x86_64/configure
> >> +++ b/scripts/ci/org.centos/stream/8/x86_64/configure
> >> @@ -35,6 +35,7 @@
> >>  --block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
> >>  --with-coroutine=ucontext \
> >>  --tls-priority=@QEMU,SYSTEM \
> >> +--disable-af-xdp \
> >>  --disable-attr \
> >>  --disable-auth-pam \
> >>  --disable-avx2 \
> >> diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
> >> index 7dd5709ef4..162e455309 100644
> >> --- a/scripts/meson-buildoptions.sh
> >> +++ b/scripts/meson-buildoptions.sh
> >> @@ -74,6 +74,7 @@ meson_options_help() {
> >>    printf "%s\n" 'disabled with --disable-FEATURE, default is enabled if 
> >> available'
> >>    printf "%s\n" '(unless built with --without-default-features):'
> >>    printf "%s\n" ''
> >> +  printf "%s\n" '  af-xdp          AF_XDP network backend support'
> >>    printf "%s\n" '  alsa            ALSA sound support'
> >>    printf "%s\n" '  attr            attr/xattr support'
> >>    printf "%s\n" '  auth-pam        PAM access control'
> >> @@ -207,6 +208,8 @@ meson_options_help() {
> >>  }
> >>  _meson_option_parse() {
> >>    case $1 in
> >> +    --enable-af-xdp) printf "%s" -Daf_xdp=enabled ;;
> >> +    --disable-af-xdp) printf "%s" -Daf_xdp=disabled ;;
> >>      --enable-alsa) printf "%s" -Dalsa=enabled ;;
> >>      --disable-alsa) printf "%s" -Dalsa=disabled ;;
> >>      --enable-attr) printf "%s" -Dattr=enabled ;;
> >> diff --git a/tests/docker/dockerfiles/debian-amd64.docker 
> >> b/tests/docker/dockerfiles/debian-amd64.docker
> >> index e39871c7bb..207f7adfb9 100644
> >> --- a/tests/docker/dockerfiles/debian-amd64.docker
> >> +++ b/tests/docker/dockerfiles/debian-amd64.docker
> >> @@ -97,6 +97,7 @@ RUN export DEBIAN_FRONTEND=noninteractive && \
> >>                        libvirglrenderer-dev \
> >>                        libvte-2.91-dev \
> >>                        libxen-dev \
> >> +                      libxdp-dev \
> >>                        libzstd-dev \
> >>                        llvm \
> >>                        locales \
> >> --
> >> 2.40.1
> >>
> >
>


Reply via email to