Just a few comments inline. Best regards, Ilya Maximets.
On 19.06.2019 22:51, William Tu wrote: > The patch introduces experimental AF_XDP support for OVS netdev. > AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket > type built upon the eBPF and XDP technology. It is aims to have comparable > performance to DPDK but cooperate better with existing kernel's networking > stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program > attached to the netdev, by-passing a couple of Linux kernel's subsystems > As a result, AF_XDP socket shows much better performance than AF_PACKET > For more details about AF_XDP, please see linux kernel's > Documentation/networking/af_xdp.rst. Note that by default, this feature is > not compiled in. > > Signed-off-by: William Tu <u9012...@gmail.com> > --- > v1->v2: > - add a list to maintain unused umem elements > - remove copy from rx umem to ovs internal buffer > - use hugetlb to reduce misses (not much difference) > - use pmd mode netdev in OVS (huge performance improve) > - remove malloc dp_packet, instead put dp_packet in umem > > v2->v3: > - rebase on the OVS master, 7ab4b0653784 > ("configure: Check for more specific function to pull in pthread library.") > - remove the dependency on libbpf and dpif-bpf. > instead, use the built-in XDP_ATTACH feature. > - data structure optimizations for better performance, see[1] > - more test cases support > v3: https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html > > v3->v4: > - Use AF_XDP API provided by libbpf > - Remove the dependency on XDP_ATTACH kernel patch set > - Add documentation, bpf.rst > > v4->v5: > - rebase to master > - remove rfc, squash all into a single patch > - add --enable-afxdp, so by default, AF_XDP is not compiled > - add options: xdpmode=drv,skb > - add multiple queue and multiple PMD support, with options: n_rxq > - improve documentation, rename bpf.rst to af_xdp.rst > > v5->v6 > - rebase to master, commit 0cdd5b13de91b98 > - address errors from sparse and clang > - pass travis-ci test > - address feedback from Ben > - fix issues reported by 0-day robot > - improved documentation > > v6-v7 > - rebase to master, commit abf11558c1515bf3b1 > - address feedbacks from Ilya, Ben, and Eelco, see: > https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html > - add XDP mode change, implement get/set_config, reconfigure > - Fix reconfiguration/crash issue caused by libbpf, see patch: > [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown > - perf optimization for batching umem_push/pop > - perf optimization for batching kick_tx > - test build with dpdk > - fix/refactor atomic operation > - make AF_XDP x86 specific, otherwise fail at build time > - lots of code refactoring > - add PVP setup in documentation > > v7-v8: > - Address feedback from Ilya at: > https://patchwork.ozlabs.org/patch/1095019/ > - add netdev-linux-private.h > - fix afxdp reconfigure issue > - sort include headers > - remove unnecessary OVS_UNUSED > - coding style fixes > - error case handling and memory leak > > v8-v9: > - rebase to master 180bbbed3a3867d52 > - Address review feedback from Ben, Ilya and Eelco, at: > https://patchwork.ozlabs.org/patch/1097740/ > - == From Ilya == > - Optimize the reconfiguration logic > - Implement .rxq_recv and .send for afxdp > - Remove system-afxdp-traffic.at, reuse existing code > - Use Ilya's rdtsc code > - remove --disable-system > - == From Eelco == > - Fix bug when remove br0, util(revalidator49)|EMER|lib/poll-loop.c:111: > assertion !fd != !wevent failed > - Fix bug and use default value from libbpf, ex: XSK_RING_PROD__DEFAULT... > - Clear xdp program when receive signal, ctrl+c > - Add options to vswitch.xml, set xdpmode default to skb-mode > - No support for ARM and PPC, now x86_64 only > - remove redundant header includes and function/macro definitions > - remove some ifdef HAVE_AF_XDP > - == From others/both about afxdp rx and tx == > - Several umem push/pop error handling improvement/fixes > - add lock to address concurrent_txq case > - improve error handling > - add stats > - Things that are not done yet > - MTU limitation > - n_txq_desc/n_rxq_desc option. > > v9-v10 > - remove x86_64 limitation, suggested by Ben and Eelco > - add xmalloc_pagealign, free_pagealign > - minor refector > > v10-v11 > - address feedback from Ilya at > https://patchwork.ozlabs.org/patch/1106495/ > - fix typos, and some refactoring > - refactor existing code and introduce xmalloc pagealign > - fix a couple of error handling case > - allocate per-txq lock > - dynamic allocate xsk array > - fix cycle_counter_update() for non-x86/non-linux case > > v11-v12 > - mainly address a couple of crashes reported by Eelco > https://patchwork.ozlabs.org/patch/1110729/ > - fix cleanup xdp program problem when ovs-vswtichd restarts > - following cases should remove xdp program > - kill `pidof ovs-vswitchd` > - ovs-appctl -t ovs-vswtichd exit --cleanup > - note: ovs-ctl restart does not have "--cleanup" so still an issue > - work around issues of xsk_ring_cons__peek at libbpf, reported at > https://marc.info/?l=xdp-newbies&m=156055471727857&w=2 > - variable name refactoring > - there are some performance degradation, but let's make sure > everything works first > > v12-v13 > - rebase to master > - add coverage counter afxdp_cq_emtpy, afxdp_fq_full > - minor refactoring > --- > Documentation/automake.mk | 1 + > Documentation/index.rst | 1 + > Documentation/intro/install/afxdp.rst | 425 ++++++++++++++++ > Documentation/intro/install/index.rst | 1 + > acinclude.m4 | 35 ++ > configure.ac | 1 + > lib/automake.mk | 14 + > lib/dp-packet.c | 28 ++ > lib/dp-packet.h | 18 +- > lib/dpif-netdev-perf.h | 26 + > lib/netdev-afxdp.c | 891 > ++++++++++++++++++++++++++++++++++ > lib/netdev-afxdp.h | 74 +++ > lib/netdev-linux-private.h | 138 ++++++ > lib/netdev-linux.c | 121 ++--- > lib/netdev-provider.h | 3 + > lib/netdev.c | 11 + > lib/spinlock.h | 70 +++ > lib/util.c | 92 +++- > lib/util.h | 5 + > lib/xdpsock.c | 170 +++++++ > lib/xdpsock.h | 101 ++++ > tests/automake.mk | 16 + > tests/system-afxdp-macros.at | 20 + > tests/system-afxdp-testsuite.at | 26 + > vswitchd/vswitch.xml | 30 ++ > 25 files changed, 2210 insertions(+), 108 deletions(-) > create mode 100644 Documentation/intro/install/afxdp.rst > create mode 100644 lib/netdev-afxdp.c > create mode 100644 lib/netdev-afxdp.h > create mode 100644 lib/netdev-linux-private.h > create mode 100644 lib/spinlock.h > create mode 100644 lib/xdpsock.c > create mode 100644 lib/xdpsock.h > create mode 100644 tests/system-afxdp-macros.at > create mode 100644 tests/system-afxdp-testsuite.at > > diff --git a/Documentation/automake.mk b/Documentation/automake.mk > index 082438e09a33..11cc59efc881 100644 > --- a/Documentation/automake.mk > +++ b/Documentation/automake.mk > @@ -10,6 +10,7 @@ DOC_SOURCE = \ > Documentation/intro/why-ovs.rst \ > Documentation/intro/install/index.rst \ > Documentation/intro/install/bash-completion.rst \ > + Documentation/intro/install/afxdp.rst \ > Documentation/intro/install/debian.rst \ > Documentation/intro/install/documentation.rst \ > Documentation/intro/install/distributions.rst \ > diff --git a/Documentation/index.rst b/Documentation/index.rst > index 46261235c732..aa9e7c49f179 100644 > --- a/Documentation/index.rst > +++ b/Documentation/index.rst > @@ -59,6 +59,7 @@ vSwitch? Start here. > :doc:`intro/install/windows` | > :doc:`intro/install/xenserver` | > :doc:`intro/install/dpdk` | > + :doc:`intro/install/afxdp` | > :doc:`Installation FAQs <faq/releases>` > > - **Tutorials:** :doc:`tutorials/faucet` | > diff --git a/Documentation/intro/install/afxdp.rst > b/Documentation/intro/install/afxdp.rst > new file mode 100644 > index 000000000000..291df8d45020 > --- /dev/null > +++ b/Documentation/intro/install/afxdp.rst > @@ -0,0 +1,425 @@ > +.. > + Licensed under the Apache License, Version 2.0 (the "License"); you may > + not use this file except in compliance with the License. You may obtain > + a copy of the License at > + > + http://www.apache.org/licenses/LICENSE-2.0 > + > + Unless required by applicable law or agreed to in writing, software > + distributed under the License is distributed on an "AS IS" BASIS, > WITHOUT > + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See > the > + License for the specific language governing permissions and limitations > + under the License. > + > + Convention for heading levels in Open vSwitch documentation: > + > + ======= Heading 0 (reserved for the title in a document) > + ------- Heading 1 > + ~~~~~~~ Heading 2 > + +++++++ Heading 3 > + ''''''' Heading 4 > + > + Avoid deeper levels because they do not render well. > + > + > +======================== > +Open vSwitch with AF_XDP > +======================== > + > +This document describes how to build and install Open vSwitch using > +AF_XDP netdev. > + > +.. warning:: > + The AF_XDP support of Open vSwitch is considered 'experimental', > + and it is not compiled in by default. > + > + > +Introduction > +------------ > +AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type > +built upon the eBPF and XDP technology. It is aims to have comparable > +performance to DPDK but cooperate better with existing kernel's networking > +stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program > +attached to the netdev, by-passing a couple of Linux kernel's subsystems. > +As a result, AF_XDP socket shows much better performance than AF_PACKET. > +For more details about AF_XDP, please see linux kernel's > +Documentation/networking/af_xdp.rst > + > + > +AF_XDP Netdev > +------------- > +OVS has a couple of netdev types, i.e., system, tap, or > +dpdk. The AF_XDP feature adds a new netdev types called > +"afxdp", and implement its configuration, packet reception, > +and transmit functions. Since the AF_XDP socket, called xsk, > +operates in userspace, once ovs-vswitchd receives packets > +from xsk, the afxdp netdev re-uses the existing userspace > +dpif-netdev datapath. As a result, most of the packet processing > +happens at the userspace instead of linux kernel. > + > +:: > + > + | +-------------------+ > + | | ovs-vswitchd |<-->ovsdb-server > + | +-------------------+ > + | | ofproto |<-->OpenFlow controllers > + | +--------+-+--------+ > + | | netdev | |ofproto-| > + userspace | +--------+ | dpif | > + | | afxdp | +--------+ > + | | netdev | | dpif | > + | +---||---+ +--------+ > + | || | dpif- | > + | || | netdev | > + |_ || +--------+ > + || > + _ +---||-----+--------+ > + | | AF_XDP prog + | > + kernel | | xsk_map | > + |_ +--------||---------+ > + || > + physical > + NIC > + > + > +Build requirements > +------------------ > + > +In addition to the requirements described in :doc:`general`, building Open > +vSwitch with AF_XDP will require the following: > + > +- libbpf from kernel source tree (kernel 5.0.0 or later) > + > +- Linux kernel XDP support, with the following options (required) > + > + * CONFIG_BPF=y > + > + * CONFIG_BPF_SYSCALL=y > + > + * CONFIG_XDP_SOCKETS=y > + > + > +- The following optional Kconfig options are also recommended, but not > + required: > + > + * CONFIG_BPF_JIT=y (Performance) > + > + * CONFIG_HAVE_BPF_JIT=y (Performance) > + > + * CONFIG_XDP_SOCKETS_DIAG=y (Debugging) > + > +- Once your AF_XDP-enabled kernel is ready, if possible, run > + **./xdpsock -r -N -z -i <your device>** under linux/samples/bpf. > + This is an OVS independent benchmark tools for AF_XDP. > + It makes sure your basic kernel requirements are met for AF_XDP. > + > + > +Installing > +---------- > +For OVS to use AF_XDP netdev, it has to be configured with LIBBPF support. > +First, clone a recent version of Linux bpf-next tree:: > + > + git clone git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git > + > +Second, go into the Linux source directory and build libbpf in the tools > +directory:: > + > + cd bpf-next/ > + cd tools/lib/bpf/ > + make && make install > + make install_headers > + > +.. note:: > + Make sure xsk.h and bpf.h are installed in system's library path, > + e.g. /usr/local/include/bpf/ or /usr/include/bpf/ > + > +Make sure the libbpf.so is installed correctly:: > + > + ldconfig > + ldconfig -p | grep libbpf > + > +Third, ensure the standard OVS requirements are installed and > +bootstrap/configure the package:: > + > + ./boot.sh && ./configure --enable-afxdp > + > +Finally, build and install OVS:: > + > + make && make install > + > +To kick start end-to-end autotesting:: > + > + uname -a # make sure having 5.0+ kernel > + make check-afxdp TESTSUITEFLAGS='1' > + > +If a test case fails, check the log at:: > + > + cat tests/system-afxdp-testsuite.dir/001/system-afxdp-testsuite.log > + > + > +Setup AF_XDP netdev > +------------------- > +Before running OVS with AF_XDP, make sure the libbpf and libelf are > +set-up right:: > + > + ldd vswitchd/ovs-vswitchd > + > +Open vSwitch should be started using userspace datapath as described > +in :doc:`general`:: > + > + ovs-vswitchd ... > + ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev > + > +Make sure your device driver support AF_XDP, and to use 1 PMD (on core 4) > +on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask, > +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or "skb":: > + > + ethtool -L enp2s0 combined 1 > + ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10 > + ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \ > + options:n_rxq=1 options:xdpmode=drv \ > + other_config:pmd-rxq-affinity="0:4" > + > +Or, use 4 pmds/cores and 4 queues by doing:: > + > + ethtool -L enp2s0 combined 4 > + ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36 > + ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \ > + options:n_rxq=4 options:xdpmode=drv \ > + other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4" > + > +.. note:: > + pmd-rxq-affinity is optional. If not specified, system will auto-assign. > + > +To validate that the bridge has successfully instantiated, you can use the:: > + > + ovs-vsctl show > + > +Should show something like:: > + > + Port "ens802f0" > + Interface "ens802f0" > + type: afxdp > + options: {n_rxq="1", xdpmode=drv} > + > +Otherwise, enable debugging by:: > + > + ovs-appctl vlog/set netdev_afxdp::dbg > + > + > +References > +---------- > +Most of the design details are described in the paper presented at > +Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1], > +section 4, and slides[2][4]. > +"The Path to DPDK Speeds for AF XDP"[3] gives a very good introduction > +about AF_XDP current and future work. > + > +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf > + > +[2] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf > + > +[3] http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf > + > +[4] https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp > + > + > +Performance Tuning > +------------------ > +The name of the game is to keep your CPU running in userspace, allowing PMD > +to keep polling the AF_XDP queues without any interferences from kernel. > + > +#. Make sure everything is in the same NUMA node (memory used by AF_XDP, pmd > + running cores, device plug-in slot) > + > +#. Isolate your CPU by doing isolcpu at grub configure. > + > +#. IRQ should not set to pmd running core. > + > +#. The Spectre and Meltdown fixes increase the overhead of system calls. > + > + > +Debugging performance issue > +~~~~~~~~~~~~~~~~~~~~~~~~~~~ > +While running the traffic, use linux perf tool to see where your cpu > +spends its cycle:: > + > + cd bpf-next/tools/perf > + make > + ./perf record -p `pidof ovs-vswitchd` sleep 10 > + ./perf report > + > +Measure your system call rate by doing:: > + > + pstree -p `pidof ovs-vswitchd` > + strace -c -p <your pmd's PID> > + > +Or, use OVS pmd tool:: > + > + ovs-appctl dpif-netdev/pmd-stats-show > + > + > +Example Script > +-------------- > + > +Below is a script using namespaces and veth peer:: > + > + #!/bin/bash > + ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl \ > + --disable-system --detach \ > + ovs-vsctl -- add-br br0 -- set Bridge br0 \ > + protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14 \ > + fail-mode=secure datapath_type=netdev > + ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev > + > + ip netns add at_ns0 > + ovs-appctl vlog/set netdev_afxdp::dbg > + > + ip link add p0 type veth peer name afxdp-p0 > + ip link set p0 netns at_ns0 > + ip link set dev afxdp-p0 up > + ovs-vsctl add-port br0 afxdp-p0 -- \ > + set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp" > + > + ip netns exec at_ns0 sh << NS_EXEC_HEREDOC > + ip addr add "10.1.1.1/24" dev p0 > + ip link set dev p0 up > + NS_EXEC_HEREDOC > + > + ip netns add at_ns1 > + ip link add p1 type veth peer name afxdp-p1 > + ip link set p1 netns at_ns1 > + ip link set dev afxdp-p1 up > + > + ovs-vsctl add-port br0 afxdp-p1 -- \ > + set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp" > + ip netns exec at_ns1 sh << NS_EXEC_HEREDOC > + ip addr add "10.1.1.2/24" dev p1 > + ip link set dev p1 up > + NS_EXEC_HEREDOC > + > + ip netns exec at_ns0 ping -i .2 10.1.1.2 > + > + > +Limitations/Known Issues > +------------------------ > +#. Device's numa ID is always 0, need a way to find numa id from a netdev. > +#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A > possible > + work-around is to use OpenFlow meter action. > +#. AF_XDP device added to bridge, remove, and added again will fail. > +#. Most of the tests are done using i40e single port. Multiple ports and > + also ixgbe driver also needs to be tested. > +#. No latency test result (TODO items) > + > + > +PVP using tap device > +-------------------- > +Assume you have enp2s0 as physical nic, and a tap device connected to VM. > +First, start OVS, then add physical port:: > + > + ethtool -L enp2s0 combined 1 > + ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10 > + ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \ > + options:n_rxq=1 options:xdpmode=drv \ > + other_config:pmd-rxq-affinity="0:4" > + > +Start a VM with virtio and tap device:: > + > + qemu-system-x86_64 -hda ubuntu1810.qcow \ > + -m 4096 \ > + -cpu host,+x2apic -enable-kvm \ > + -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\ > + vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \ > + -netdev type=tap,id=net0,vhost=on,queues=8 \ > + -object memory-backend-file,id=mem,size=4096M,\ > + mem-path=/dev/hugepages,share=on \ > + -numa node,memdev=mem -mem-prealloc -smp 2 > + > +Create OpenFlow rules:: > + > + ovs-vsctl add-port br0 tap0 -- set interface tap0 > + ovs-ofctl del-flows br0 > + ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0" > + ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0" > + > +Inside the VM, use xdp_rxq_info to bounce back the traffic:: > + > + ./xdp_rxq_info --dev ens3 --action XDP_TX > + > + > +PVP using vhostuser device > +-------------------------- > +First, build OVS with DPDK and AFXDP:: > + > + ./configure --enable-afxdp --with-dpdk=<dpdk path> > + make -j4 && make install > + > +Create a vhost-user port from OVS:: > + > + ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true > + ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \ > + other_config:pmd-cpu-mask=0xfff > + ovs-vsctl add-port br0 vhost-user-1 \ > + -- set Interface vhost-user-1 type=dpdkvhostuser > + > +Start VM using vhost-user mode:: > + > + qemu-system-x86_64 -hda ubuntu1810.qcow \ > + -m 4096 \ > + -cpu host,+x2apic -enable-kvm \ > + -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 > \ > + -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \ > + -device virtio-net-pci,mac=00:00:00:00:00:01,\ > + netdev=mynet1,mq=on,vectors=10 \ > + -object memory-backend-file,id=mem,size=4096M,\ > + mem-path=/dev/hugepages,share=on \ > + -numa node,memdev=mem -mem-prealloc -smp 2 > + > +Setup the OpenFlow ruls:: > + > + ovs-ofctl del-flows br0 > + ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:vhost-user-1" > + ovs-ofctl add-flow br0 "in_port=vhost-user-1, actions=output:enp2s0" > + > +Inside the VM, use xdp_rxq_info to drop or bounce back the traffic:: > + > + ./xdp_rxq_info --dev ens3 --action XDP_DROP > + ./xdp_rxq_info --dev ens3 --action XDP_TX > + > + > +PCP container using veth > +------------------------ > +Create namespace and veth peer devices:: > + > + ip netns add at_ns0 > + ip link add p0 type veth peer name afxdp-p0 > + ip link set p0 netns at_ns0 > + ip link set dev afxdp-p0 up > + ip netns exec at_ns0 ip link set dev p0 up > + > +Attach the veth port to br0 (linux kernel mode):: > + > + ovs-vsctl add-port br0 afxdp-p0 -- \ > + set interface afxdp-p0 options:n_rxq=1 > + > +Or, use AF_XDP with skb mode:: > + > + ovs-vsctl add-port br0 afxdp-p0 -- \ > + set interface afxdp-p0 type="afxdp" options:n_rxq=1 options:xdpmode=skb > + > +Setup the OpenFlow rules:: > + > + ovs-ofctl del-flows br0 > + ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0" > + ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0" > + > +In the namespace, run drop or bounce back the packet:: > + > + ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP > + ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX > + > + > +Bug Reporting > +------------- > + > +Please report problems to d...@openvswitch.org. > diff --git a/Documentation/intro/install/index.rst > b/Documentation/intro/install/index.rst > index 3193c736cf17..c27a9c9d16ff 100644 > --- a/Documentation/intro/install/index.rst > +++ b/Documentation/intro/install/index.rst > @@ -45,6 +45,7 @@ Installation from Source > xenserver > userspace > dpdk > + afxdp > > Installation from Packages > -------------------------- > diff --git a/acinclude.m4 b/acinclude.m4 > index 321a741985db..bb03b504a2a8 100644 > --- a/acinclude.m4 > +++ b/acinclude.m4 > @@ -238,6 +238,41 @@ AC_DEFUN([OVS_FIND_DEPENDENCY], [ > ]) > ]) > > +dnl OVS_CHECK_LINUX_AF_XDP > +dnl > +dnl Check both Linux kernel AF_XDP and libbpf support > +AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [ > + AC_ARG_ENABLE([afxdp], > + [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP support])], > + [], [enable_afxdp=no]) > + AC_MSG_CHECKING([whether AF_XDP is enabled]) > + if test "$enable_afxdp" != yes; then > + AC_MSG_RESULT([no]) > + AF_XDP_ENABLE=false > + else > + AC_MSG_RESULT([yes]) > + AF_XDP_ENABLE=true > + > + AC_CHECK_HEADER([bpf/libbpf.h], [], > + [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP support])]) > + > + AC_CHECK_HEADER([linux/if_xdp.h], [], > + [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP support])]) > + > + AC_CHECK_HEADER([bpf/xsk.h], [], > + [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])]) > + > + AC_CHECK_HEADER([bpf/libbpf_util.h], [], > + [AC_MSG_ERROR([unable to find bpf/libbpf_util.h for AF_XDP support])]) > + > + AC_DEFINE([HAVE_AF_XDP], [1], > + [Define to 1 if AF_XDP support is available and enabled.]) > + LIBBPF_LDADD=" -lbpf -lelf" > + AC_SUBST([LIBBPF_LDADD]) > + fi > + AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true) > +]) > + > dnl OVS_CHECK_DPDK > dnl > dnl Configure DPDK source tree > diff --git a/configure.ac b/configure.ac > index a9f0a06dc140..36ad246203db 100644 > --- a/configure.ac > +++ b/configure.ac > @@ -98,6 +98,7 @@ OVS_CHECK_SPHINX > OVS_CHECK_DOT > OVS_CHECK_IF_DL > OVS_CHECK_STRTOK_R > +OVS_CHECK_LINUX_AF_XDP > AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]]) > AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec], > [], [], [[#include <sys/stat.h>]]) > diff --git a/lib/automake.mk b/lib/automake.mk > index 1b89cac8c3a2..9b75e47ba396 100644 > --- a/lib/automake.mk > +++ b/lib/automake.mk > @@ -14,6 +14,10 @@ if WIN32 > lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS} > endif > > +if HAVE_AF_XDP > +lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD) > +endif > + > lib_libopenvswitch_la_LDFLAGS = \ > $(OVS_LTINFO) \ > -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \ > @@ -394,6 +398,7 @@ lib_libopenvswitch_la_SOURCES += \ > lib/if-notifier.h \ > lib/netdev-linux.c \ > lib/netdev-linux.h \ > + lib/netdev-linux-private.h \ > lib/netdev-offload-tc.c \ > lib/netlink-conntrack.c \ > lib/netlink-conntrack.h \ > @@ -410,6 +415,15 @@ lib_libopenvswitch_la_SOURCES += \ > lib/tc.h > endif > > +if HAVE_AF_XDP > +lib_libopenvswitch_la_SOURCES += \ > + lib/xdpsock.c \ > + lib/xdpsock.h \ > + lib/netdev-afxdp.c \ > + lib/netdev-afxdp.h \ > + lib/spinlock.h > +endif > + > if DPDK_NETDEV > lib_libopenvswitch_la_SOURCES += \ > lib/dpdk.c \ > diff --git a/lib/dp-packet.c b/lib/dp-packet.c > index 0976a35e758b..e6a7947076b4 100644 > --- a/lib/dp-packet.c > +++ b/lib/dp-packet.c > @@ -19,6 +19,7 @@ > #include <string.h> > > #include "dp-packet.h" > +#include "netdev-afxdp.h" > #include "netdev-dpdk.h" > #include "openvswitch/dynamic-string.h" > #include "util.h" > @@ -59,6 +60,27 @@ dp_packet_use(struct dp_packet *b, void *base, size_t > allocated) > dp_packet_use__(b, base, allocated, DPBUF_MALLOC); > } > > +#if HAVE_AF_XDP > +/* Initialize 'b' as an empty dp_packet that contains > + * memory starting at AF_XDP umem base. > + */ > +void > +dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t allocated) > +{ > + dp_packet_set_base(b, base); > + dp_packet_set_data(b, base); > + dp_packet_set_size(b, 0); > + > + dp_packet_set_allocated(b, allocated); > + b->source = DPBUF_AFXDP; > + dp_packet_reset_offsets(b); > + pkt_metadata_init(&b->md, 0); > + dp_packet_reset_cutlen(b); > + dp_packet_reset_offload(b); > + b->packet_type = htonl(PT_ETH); > +} > +#endif > + > /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes > of > * memory starting at 'base'. 'base' should point to a buffer on the stack. > * (Nothing actually relies on 'base' being allocated on the stack. It could > @@ -122,6 +144,8 @@ dp_packet_uninit(struct dp_packet *b) > * created as a dp_packet */ > free_dpdk_buf((struct dp_packet*) b); > #endif > + } else if (b->source == DPBUF_AFXDP) { > + free_afxdp_buf(b); > } > } > } > @@ -248,6 +272,9 @@ dp_packet_resize__(struct dp_packet *b, size_t > new_headroom, size_t new_tailroom > case DPBUF_STACK: > OVS_NOT_REACHED(); > > + case DPBUF_AFXDP: > + OVS_NOT_REACHED(); > + > case DPBUF_STUB: > b->source = DPBUF_MALLOC; > new_base = xmalloc(new_allocated); > @@ -433,6 +460,7 @@ dp_packet_steal_data(struct dp_packet *b) > { > void *p; > ovs_assert(b->source != DPBUF_DPDK); > + ovs_assert(b->source != DPBUF_AFXDP); > > if (b->source == DPBUF_MALLOC && dp_packet_data(b) == dp_packet_base(b)) > { > p = dp_packet_data(b); > diff --git a/lib/dp-packet.h b/lib/dp-packet.h > index a5e9ade1244a..e3438226e360 100644 > --- a/lib/dp-packet.h > +++ b/lib/dp-packet.h > @@ -25,6 +25,7 @@ > #include <rte_mbuf.h> > #endif > > +#include "netdev-afxdp.h" > #include "netdev-dpdk.h" > #include "openvswitch/list.h" > #include "packets.h" > @@ -42,6 +43,7 @@ enum OVS_PACKED_ENUM dp_packet_source { > DPBUF_DPDK, /* buffer data is from DPDK allocated memory. > * ref to dp_packet_init_dpdk() in > dp-packet.c. > */ > + DPBUF_AFXDP, /* buffer data from XDP frame */ > }; > > #define DP_PACKET_CONTEXT_SIZE 64 > @@ -89,6 +91,13 @@ struct dp_packet { > }; > }; > > +#if HAVE_AF_XDP > +struct dp_packet_afxdp { > + struct umem_pool *mpool; > + struct dp_packet packet; > +}; > +#endif > + > static inline void *dp_packet_data(const struct dp_packet *); > static inline void dp_packet_set_data(struct dp_packet *, void *); > static inline void *dp_packet_base(const struct dp_packet *); > @@ -122,7 +131,9 @@ static inline const void *dp_packet_get_nd_payload(const > struct dp_packet *); > void dp_packet_use(struct dp_packet *, void *, size_t); > void dp_packet_use_stub(struct dp_packet *, void *, size_t); > void dp_packet_use_const(struct dp_packet *, const void *, size_t); > - > +#if HAVE_AF_XDP > +void dp_packet_use_afxdp(struct dp_packet *, void *, size_t); > +#endif > void dp_packet_init_dpdk(struct dp_packet *); > > void dp_packet_init(struct dp_packet *, size_t); > @@ -184,6 +195,11 @@ dp_packet_delete(struct dp_packet *b) > return; > } > > + if (b->source == DPBUF_AFXDP) { > + free_afxdp_buf(b); > + return; > + } > + > dp_packet_uninit(b); > free(b); > } > diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h > index 859c05613ddf..6b6dfda7db1c 100644 > --- a/lib/dpif-netdev-perf.h > +++ b/lib/dpif-netdev-perf.h > @@ -21,6 +21,7 @@ > #include <stddef.h> > #include <stdint.h> > #include <string.h> > +#include <time.h> > #include <math.h> > > #ifdef DPDK_NETDEV > @@ -186,6 +187,24 @@ struct pmd_perf_stats { > char *log_reason; > }; > > +#ifdef __linux__ > +static inline uint64_t > +rdtsc_syscall(struct pmd_perf_stats *s) > +{ > + struct timespec val; > + uint64_t v; > + > + if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) { > + return s->last_tsc; > + } > + > + v = (uint64_t) val.tv_sec * 1000000000LL; > + v += (uint64_t) val.tv_nsec; > + > + return s->last_tsc = v; > +} > +#endif > + > /* Support for accurate timing of PMD execution on TSC clock cycle level. > * These functions are intended to be invoked in the context of pmd threads. > */ > > @@ -198,6 +217,13 @@ cycles_counter_update(struct pmd_perf_stats *s) > { > #ifdef DPDK_NETDEV > return s->last_tsc = rte_get_tsc_cycles(); > +#elif !defined(_MSC_VER) && defined(__x86_64__) > + uint32_t h, l; > + asm volatile("rdtsc" : "=a" (l), "=d" (h)); > + > + return s->last_tsc = ((uint64_t) h << 32) | l; > +#elif defined(__linux__) > + return rdtsc_syscall(s); > #else > return s->last_tsc = 0; > #endif > diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c > new file mode 100644 > index 000000000000..33d8612153d5 > --- /dev/null > +++ b/lib/netdev-afxdp.c > @@ -0,0 +1,891 @@ > +/* > + * Copyright (c) 2018, 2019 Nicira, Inc. > + * > + * Licensed under the Apache License, Version 2.0 (the "License"); > + * you may not use this file except in compliance with the License. > + * You may obtain a copy of the License at: > + * > + * http://www.apache.org/licenses/LICENSE-2.0 > + * > + * Unless required by applicable law or agreed to in writing, software > + * distributed under the License is distributed on an "AS IS" BASIS, > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > + * See the License for the specific language governing permissions and > + * limitations under the License. > + */ > + > +#include <config.h> > + > +#include "netdev-linux-private.h" > +#include "netdev-linux.h" > +#include "netdev-afxdp.h" > + > +#include <errno.h> > +#include <inttypes.h> > +#include <linux/rtnetlink.h> > +#include <linux/if_xdp.h> > +#include <net/if.h> > +#include <stdlib.h> > +#include <sys/resource.h> > +#include <sys/socket.h> > +#include <sys/types.h> > +#include <unistd.h> > + > +#include "coverage.h" > +#include "dp-packet.h" > +#include "dpif-netdev.h" > +#include "openvswitch/dynamic-string.h" > +#include "openvswitch/vlog.h" > +#include "packets.h" > +#include "socket-util.h" > +#include "spinlock.h" > +#include "util.h" > +#include "xdpsock.h" > + > +#ifndef SOL_XDP > +#define SOL_XDP 283 > +#endif > + > +COVERAGE_DEFINE(afxdp_cq_empty); > +COVERAGE_DEFINE(afxdp_fq_full); > + > +VLOG_DEFINE_THIS_MODULE(netdev_afxdp); > +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); > + > +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char *)base)) > +#define UMEM2XPKT(base, i) \ > + ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base + \ > + i * sizeof(struct dp_packet_afxdp)) > + > +static struct xsk_socket_info *xsk_configure(int ifindex, int xdp_queue_id, > + int mode); > +static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode); > +static void xsk_destroy(struct xsk_socket_info *xsk); > +static int xsk_configure_all(struct netdev *netdev); > +static void xsk_destroy_all(struct netdev *netdev); > + > +static struct xsk_umem_info * > +xsk_configure_umem(void *buffer, uint64_t size, int xdpmode) > +{ > + struct xsk_umem_config uconfig OVS_UNUSED; > + struct xsk_umem_info *umem; > + int ret; > + int i; > + > + umem = xcalloc(1, sizeof *umem); > + ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq, > + NULL); > + if (ret) { > + VLOG_ERR("xsk_umem__create failed (%s) mode: %s", > + ovs_strerror(errno), > + xdpmode == XDP_COPY ? "SKB": "DRV"); > + free(umem); > + return NULL; > + } > + > + umem->buffer = buffer; > + > + /* set-up umem pool */ > + if (umem_pool_init(&umem->mpool, NUM_FRAMES) < 0) { > + VLOG_ERR("umem_pool_init failed"); > + if (xsk_umem__delete(umem->umem)) { > + VLOG_ERR("xsk_umem__delete failed"); > + } > + free(umem); > + return NULL; > + } > + > + for (i = NUM_FRAMES - 1; i >= 0; i--) { > + struct umem_elem *elem; > + > + elem = ALIGNED_CAST(struct umem_elem *, > + (char *)umem->buffer + i * FRAME_SIZE); > + umem_elem_push(&umem->mpool, elem); > + } > + > + /* set-up metadata */ > + if (xpacket_pool_init(&umem->xpool, NUM_FRAMES) < 0) { > + VLOG_ERR("xpacket_pool_init failed"); > + umem_pool_cleanup(&umem->mpool); > + if (xsk_umem__delete(umem->umem)) { > + VLOG_ERR("xsk_umem__delete failed"); > + } > + free(umem); > + return NULL; > + } > + > + VLOG_DBG("%s xpacket pool from %p to %p", __func__, > + umem->xpool.array, > + (char *)umem->xpool.array + > + NUM_FRAMES * sizeof(struct dp_packet_afxdp)); > + > + for (i = NUM_FRAMES - 1; i >= 0; i--) { > + struct dp_packet_afxdp *xpacket; > + struct dp_packet *packet; > + > + xpacket = UMEM2XPKT(umem->xpool.array, i); > + xpacket->mpool = &umem->mpool; > + > + packet = &xpacket->packet; > + packet->source = DPBUF_AFXDP; > + } > + > + return umem; > +} > + > +static struct xsk_socket_info * > +xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex, > + uint32_t queue_id, int xdpmode) > +{ > + struct xsk_socket_config cfg; > + struct xsk_socket_info *xsk; > + char devname[IF_NAMESIZE]; > + uint32_t idx = 0, prog_id; > + int ret; > + int i; > + > + xsk = xcalloc(1, sizeof(*xsk)); > + xsk->umem = umem; > + cfg.rx_size = CONS_NUM_DESCS; > + cfg.tx_size = PROD_NUM_DESCS; > + cfg.libbpf_flags = 0; > + > + if (xdpmode == XDP_ZEROCOPY) { > + cfg.bind_flags = XDP_ZEROCOPY; > + cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE; > + } else { > + cfg.bind_flags = XDP_COPY; > + cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE; > + } > + > + if (if_indextoname(ifindex, devname) == NULL) { > + VLOG_ERR("ifindex %d to devname failed (%s)", > + ifindex, ovs_strerror(errno)); > + free(xsk); > + return NULL; > + } > + > + ret = xsk_socket__create(&xsk->xsk, devname, queue_id, umem->umem, > + &xsk->rx, &xsk->tx, &cfg); > + if (ret) { > + VLOG_ERR("xsk_socket__create failed (%s) mode: %s qid: %d", > + ovs_strerror(errno), > + xdpmode == XDP_COPY ? "SKB": "DRV", > + queue_id); > + free(xsk); > + return NULL; > + } > + > + /* Make sure the built-in AF_XDP program is loaded */ > + ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags); > + if (ret) { > + VLOG_ERR("Get XDP prog ID failed (%s)", ovs_strerror(errno)); > + xsk_socket__delete(xsk->xsk); > + free(xsk); > + return NULL; > + } > + > + while (!xsk_ring_prod__reserve(&xsk->umem->fq, > + PROD_NUM_DESCS, &idx)) { > + VLOG_WARN_RL(&rl, "Retry xsk_ring_prod__reserve to FILL queue"); > + } > + > + for (i = 0; > + i < PROD_NUM_DESCS * FRAME_SIZE; > + i += FRAME_SIZE) { > + struct umem_elem *elem; > + uint64_t addr; > + > + elem = umem_elem_pop(&xsk->umem->mpool); > + addr = UMEM2DESC(elem, xsk->umem->buffer); > + > + *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr; > + } > + > + xsk_ring_prod__submit(&xsk->umem->fq, > + PROD_NUM_DESCS); > + return xsk; > +} > + > +static struct xsk_socket_info * > +xsk_configure(int ifindex, int xdp_queue_id, int xdpmode) > +{ > + struct xsk_socket_info *xsk; > + struct xsk_umem_info *umem; > + void *bufs; > + > + /* umem memory region */ > + bufs = xmalloc_pagealign(NUM_FRAMES * FRAME_SIZE); > + memset(bufs, 0, NUM_FRAMES * FRAME_SIZE); > + > + /* create AF_XDP socket */ > + umem = xsk_configure_umem(bufs, > + NUM_FRAMES * FRAME_SIZE, > + xdpmode); > + if (!umem) { > + free_pagealign(bufs); > + return NULL; > + } > + > + xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode); > + if (!xsk) { > + /* clean up umem and xpacket pool */ > + if (xsk_umem__delete(umem->umem)) { > + VLOG_ERR("xsk_umem__delete failed"); > + } > + free_pagealign(bufs); > + umem_pool_cleanup(&umem->mpool); > + xpacket_pool_cleanup(&umem->xpool); > + free(umem); > + } > + return xsk; > +} > + > +static int > +xsk_configure_all(struct netdev *netdev) > +{ > + struct netdev_linux *dev = netdev_linux_cast(netdev); > + struct xsk_socket_info *xsk_info; > + int i, ifindex, n_rxq; > + > + ifindex = linux_get_ifindex(netdev_get_name(netdev)); > + > + n_rxq = netdev_n_rxq(netdev); > + dev->xsks = xzalloc(n_rxq * sizeof(struct xsk_socket_info *)); > + > + /* configure each queue */ > + for (i = 0; i < n_rxq; i++) { > + VLOG_INFO("%s configure queue %d mode %s", __func__, i, > + dev->xdpmode == XDP_COPY ? "SKB" : "DRV"); > + xsk_info = xsk_configure(ifindex, i, dev->xdpmode); > + if (!xsk_info) { > + VLOG_ERR("failed to create AF_XDP socket on queue %d", i); > + dev->xsks[i] = NULL; > + goto err; > + } > + dev->xsks[i] = xsk_info; > + xsk_info->rx_dropped = 0; > + xsk_info->tx_dropped = 0; > + } > + > + return 0; > + > +err: > + xsk_destroy_all(netdev); > + return EINVAL; > +} > + > +static void > +xsk_destroy(struct xsk_socket_info *xsk_info) > +{ > + struct xsk_umem *umem; > + > + xsk_socket__delete(xsk_info->xsk); > + xsk_info->xsk = NULL; > + > + umem = xsk_info->umem->umem; > + if (xsk_umem__delete(umem)) { > + VLOG_ERR("xsk_umem__delete failed"); > + } > + > + /* free the packet buffer */ > + free_pagealign(xsk_info->umem->buffer); > + > + /* cleanup umem pool */ > + umem_pool_cleanup(&xsk_info->umem->mpool); > + > + /* cleanup metadata pool */ > + xpacket_pool_cleanup(&xsk_info->umem->xpool); > + > + free(xsk_info->umem); > + free(xsk_info); > +} > + > +static void > +xsk_destroy_all(struct netdev *netdev) > +{ > + struct netdev_linux *dev = netdev_linux_cast(netdev); > + int i, ifindex; > + > + ifindex = linux_get_ifindex(netdev_get_name(netdev)); > + > + for (i = 0; i < netdev_n_rxq(netdev); i++) { > + if (dev->xsks && dev->xsks[i]) { > + VLOG_INFO("destroy xsk[%d]", i); > + xsk_destroy(dev->xsks[i]); > + dev->xsks[i] = NULL; > + } > + } > + > + VLOG_INFO("remove xdp program"); > + xsk_remove_xdp_program(ifindex, dev->xdpmode); > + > + free(dev->xsks); > +} > + > +static inline void OVS_UNUSED > +log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) { > + struct xdp_statistics stat; > + socklen_t optlen; > + > + optlen = sizeof stat; > + ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, XDP_STATISTICS, > + &stat, &optlen) == 0); > + > + VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid %llu", > + stat.rx_dropped, > + stat.rx_invalid_descs, > + stat.tx_invalid_descs); > +} > + > +int > +netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args, > + char **errp OVS_UNUSED) > +{ > + struct netdev_linux *dev = netdev_linux_cast(netdev); > + const char *str_xdpmode; > + int xdpmode, new_n_rxq; > + > + ovs_mutex_lock(&dev->mutex); > + new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1); > + if (new_n_rxq > MAX_XSKQ) { > + ovs_mutex_unlock(&dev->mutex); > + VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).", > + netdev_get_name(netdev), new_n_rxq, MAX_XSKQ); > + return EINVAL; > + } > + > + str_xdpmode = smap_get_def(args, "xdpmode", "skb"); > + if (!strcasecmp(str_xdpmode, "drv")) { > + xdpmode = XDP_ZEROCOPY; > + } else if (!strcasecmp(str_xdpmode, "skb")) { > + xdpmode = XDP_COPY; > + } else { > + VLOG_ERR("%s: Incorrect xdpmode (%s).", > + netdev_get_name(netdev), str_xdpmode); > + ovs_mutex_unlock(&dev->mutex); > + return EINVAL; > + } > + > + if (dev->requested_n_rxq != new_n_rxq > + || dev->requested_xdpmode != xdpmode) { > + dev->requested_n_rxq = new_n_rxq; > + dev->requested_xdpmode = xdpmode; > + netdev_request_reconfigure(netdev); > + } > + ovs_mutex_unlock(&dev->mutex); > + return 0; > +} > + > +int > +netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args) > +{ > + struct netdev_linux *dev = netdev_linux_cast(netdev); > + > + ovs_mutex_lock(&dev->mutex); > + smap_add_format(args, "n_rxq", "%d", netdev->n_rxq); > + smap_add_format(args, "xdpmode", "%s", > + dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb"); > + ovs_mutex_unlock(&dev->mutex); > + return 0; > +} > + > +static void > +netdev_afxdp_alloc_txq(struct netdev *netdev) > +{ > + struct netdev_linux *dev = netdev_linux_cast(netdev); > + int n_txqs = netdev_n_rxq(netdev); > + int i; > + > + dev->tx_locks = xmalloc(n_txqs * sizeof(struct ovs_spinlock)); > + > + for (i = 0; i < n_txqs; i++) { > + ovs_spinlock_init(&dev->tx_locks[i]); > + } > +} > + > +int > +netdev_afxdp_reconfigure(struct netdev *netdev) > +{ > + struct netdev_linux *dev = netdev_linux_cast(netdev); > + struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY}; > + int err = 0; > + > + ovs_mutex_lock(&dev->mutex); > + > + if (netdev->n_rxq == dev->requested_n_rxq > + && dev->xdpmode == dev->requested_xdpmode) { > + goto out; > + } > + > + xsk_destroy_all(netdev); > + free(dev->tx_locks); > + > + netdev->n_rxq = dev->requested_n_rxq; > + netdev_afxdp_alloc_txq(netdev); > + > + if (dev->requested_xdpmode == XDP_ZEROCOPY) { > + VLOG_INFO("AF_XDP device %s in DRV mode", netdev_get_name(netdev)); > + /* From SKB mode to DRV mode */ > + dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE; > + dev->xdp_bind_flags = XDP_ZEROCOPY; > + dev->xdpmode = XDP_ZEROCOPY; > + > + if (setrlimit(RLIMIT_MEMLOCK, &r)) { > + VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s", > + ovs_strerror(errno)); > + } > + } else { > + VLOG_INFO("AF_XDP device %s in SKB mode", netdev_get_name(netdev)); > + /* From DRV mode to SKB mode */ > + dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE; > + dev->xdp_bind_flags = XDP_COPY; > + dev->xdpmode = XDP_COPY; > + /* TODO: set rlimit back to previous value > + * when no device is in DRV mode. > + */ > + } > + > + err = xsk_configure_all(netdev); > + if (err) { > + VLOG_ERR("AF_XDP device %s reconfig fails", netdev_get_name(netdev)); > + } > + netdev_change_seq_changed(netdev); > +out: > + ovs_mutex_unlock(&dev->mutex); > + return err; > +} > + > +int > +netdev_afxdp_get_numa_id(const struct netdev *netdev) > +{ > + /* FIXME: Get netdev's PCIe device ID, then find > + * its NUMA node id. > + */ > + VLOG_INFO("FIXME: Device %s always use numa id 0", > + netdev_get_name(netdev)); > + return 0; > +} > + > +static void > +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode) > +{ > + uint32_t prog_id = 0; > + uint32_t flags; > + > + /* remove_xdp_program() */ > + if (xdpmode == XDP_COPY) { > + flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE; > + VLOG_INFO("%s copy mode", __func__); > + } else { > + flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE; > + VLOG_INFO("%s drv mode", __func__); > + } > + > + if (bpf_get_link_xdp_id(ifindex, &prog_id, flags)) { > + VLOG_WARN("get xdp program id fails"); > + } > + bpf_set_link_xdp_fd(ifindex, -1, XDP_FLAGS_UPDATE_IF_NOEXIST); > +} > + > +void > +signal_remove_xdp(struct netdev *netdev) > +{ > + struct netdev_linux *dev = netdev_linux_cast(netdev); > + int ifindex; > + > + ifindex = linux_get_ifindex(netdev_get_name(netdev)); > + > + VLOG_WARN("force remove xdp program"); > + xsk_remove_xdp_program(ifindex, dev->xdpmode); > +} > + > +static struct dp_packet_afxdp * > +dp_packet_cast_afxdp(const struct dp_packet *d) > +{ > + ovs_assert(d->source == DPBUF_AFXDP); > + return CONTAINER_OF(d, struct dp_packet_afxdp, packet); > +} > + > +static inline void > +prepare_fill_queue(struct xsk_socket_info *xsk_info) > +{ > + struct umem_elem *elems[BATCH_SIZE]; > + struct xsk_umem_info *umem; > + unsigned int idx_fq; > + int nb_free; > + int i, ret; > + > + umem = xsk_info->umem; > + > + nb_free = PROD_NUM_DESCS / 2; > + if (xsk_prod_nb_free(&umem->fq, nb_free) < nb_free) { > + return; > + } Why you're using 'PROD_NUM_DESCS / 2' here? IIUC, we're keeping fill queue half-loaded. Isn't it better to use BATCH_SIZE instead? > + > + ret = umem_elem_pop_n(&umem->mpool, BATCH_SIZE, (void **)elems); > + if (OVS_UNLIKELY(ret)) { > + return; > + } > + > + if (!xsk_ring_prod__reserve(&umem->fq, BATCH_SIZE, &idx_fq)) { > + umem_elem_push_n(&umem->mpool, BATCH_SIZE, (void **)elems); > + COVERAGE_INC(afxdp_fq_full); > + return; > + } > + > + for (i = 0; i < BATCH_SIZE; i++) { > + uint64_t index; > + struct umem_elem *elem; > + > + elem = elems[i]; > + index = (uint64_t)((char *)elem - (char *)umem->buffer); > + ovs_assert((index & FRAME_SHIFT_MASK) == 0); > + *xsk_ring_prod__fill_addr(&umem->fq, idx_fq) = index; > + > + idx_fq++; > + } > + xsk_ring_prod__submit(&umem->fq, BATCH_SIZE); > +} > + > +int > +netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch, > + int *qfill) > +{ > + struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_); > + struct netdev *netdev = rx->up.netdev; > + struct netdev_linux *dev = netdev_linux_cast(netdev); > + struct xsk_socket_info *xsk_info; > + struct xsk_umem_info *umem; > + uint32_t idx_rx = 0; > + int qid = rxq_->queue_id; > + unsigned int rcvd, i; > + > + xsk_info = dev->xsks[qid]; > + if (!xsk_info || !xsk_info->xsk) { > + return 0; Need to return EAGAIN. > + } > + > + prepare_fill_queue(xsk_info); > + > + umem = xsk_info->umem; > + rx->fd = xsk_socket__fd(xsk_info->xsk); > + > + rcvd = xsk_ring_cons__peek(&xsk_info->rx, BATCH_SIZE, &idx_rx); > + if (!rcvd) { > + return 0; Need to return EAGAIN. > + } > + > + /* Setup a dp_packet batch from descriptors in RX queue */ > + for (i = 0; i < rcvd; i++) { > + uint64_t addr = xsk_ring_cons__rx_desc(&xsk_info->rx, idx_rx)->addr; > + uint32_t len = xsk_ring_cons__rx_desc(&xsk_info->rx, idx_rx)->len; > + char *pkt = xsk_umem__get_data(umem->buffer, addr); > + uint64_t index; > + > + struct dp_packet_afxdp *xpacket; > + struct dp_packet *packet; > + > + index = addr >> FRAME_SHIFT; > + xpacket = UMEM2XPKT(umem->xpool.array, index); > + packet = &xpacket->packet; > + > + /* Initialize the struct dp_packet */ > + dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM); > + dp_packet_set_size(packet, len); > + > + /* Add packet into batch, increase batch->count */ > + dp_packet_batch_add(batch, packet); > + > + idx_rx++; > + } > + /* Release the RX queue */ > + xsk_ring_cons__release(&xsk_info->rx, rcvd); > + > + if (qfill) { > + /* TODO: return the number of remaining packets in the queue. */ > + *qfill = 0; > + } > + > +#ifdef AFXDP_DEBUG > + log_xsk_stat(xsk_info); > +#endif > + return 0; > +} > + > +static inline int > +kick_tx(struct xsk_socket_info *xsk_info) > +{ > + int ret; > + > + if (!xsk_info->outstanding_tx) { > + return 0; > + } > + > + /* This causes system call into kernel's xsk_sendmsg, and > + * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode). > + */ > + ret = sendto(xsk_socket__fd(xsk_info->xsk), NULL, 0, MSG_DONTWAIT, > + NULL, 0); > + if (OVS_UNLIKELY(ret < 0)) { > + if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) { > + return errno; > + } > + } > + /* no error, or EBUSY or EAGAIN */ > + return 0; > +} > + > +void > +free_afxdp_buf(struct dp_packet *p) > +{ > + struct dp_packet_afxdp *xpacket; > + uintptr_t addr; > + > + xpacket = dp_packet_cast_afxdp(p); > + if (xpacket->mpool) { > + void *base = dp_packet_base(p); > + > + addr = (uintptr_t)base & (~FRAME_SHIFT_MASK); > + umem_elem_push(xpacket->mpool, (void *)addr); > + } > +} > + > +static void > +free_afxdp_buf_batch(struct dp_packet_batch *batch) > +{ > + struct dp_packet_afxdp *xpacket = NULL; > + struct dp_packet *packet; > + void *elems[BATCH_SIZE]; > + uintptr_t addr; > + > + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { > + xpacket = dp_packet_cast_afxdp(packet); > + if (xpacket->mpool) { Above checking seems useless. Also, if any packet will be skipped, we'll push trash pointer to mpool. If you're worrying about the value, you may just assert: ovs_assert(xpacket->mpool); > + void *base = dp_packet_base(packet); > + > + addr = (uintptr_t)base & (~FRAME_SHIFT_MASK); > + elems[i] = (void *)addr; > + } > + } > + umem_elem_push_n(xpacket->mpool, batch->count, elems); > + dp_packet_batch_init(batch); > +} > + > +static inline bool > +check_free_batch(struct dp_packet_batch *batch) > +{ > + struct umem_pool *first_mpool = NULL; > + struct dp_packet_afxdp *xpacket; > + struct dp_packet *packet; > + > + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { > + if (packet->source != DPBUF_AFXDP) { > + return false; > + } > + xpacket = dp_packet_cast_afxdp(packet); > + if (i == 0) { > + first_mpool = xpacket->mpool; > + continue; > + } > + if (xpacket->mpool != first_mpool) { > + return false; > + } > + } > + /* All packets are DPBUF_AFXDP and from the same mpool */ > + return true; > +} > + > +static inline void > +afxdp_complete_tx(struct xsk_socket_info *xsk_info) > +{ > + struct umem_elem *elems_push[BATCH_SIZE]; > + struct xsk_umem_info *umem; > + uint32_t idx_cq = 0; > + int tx_to_free = 0; > + int tx_done, j; > + > + umem = xsk_info->umem; > + tx_done = xsk_ring_cons__peek(&umem->cq, BATCH_SIZE, &idx_cq); > + > + /* Recycle back to umem pool */ > + for (j = 0; j < tx_done; j++) { > + struct umem_elem *elem; > + uint64_t *addr; > + > + addr = (uint64_t *)xsk_ring_cons__comp_addr(&umem->cq, idx_cq++); > + if (*addr == 0) { 'addr' is an offset from 'umem->buffer'. Zero seems a valid value. Maybe it's better to use UINT64_MAX instead? > + /* The elem has been pushed already */ > + continue; > + } > + elem = ALIGNED_CAST(struct umem_elem *, > + (char *)umem->buffer + *addr); > + elems_push[tx_to_free] = elem; > + *addr = 0; /* Mark as pushed */ > + tx_to_free++; > + } > + > + umem_elem_push_n(&umem->mpool, tx_to_free, (void **)elems_push); > + > + if (tx_done > 0) { > + xsk_ring_cons__release(&umem->cq, tx_done); > + xsk_info->outstanding_tx -= tx_done; We, probably, should substract the 'tx_to_free' instead and do this outside of the 'if'. > + } else { > + COVERAGE_INC(afxdp_cq_empty); > + } > +} _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev