On 6/30/23 09:44, Jason Wang wrote: > On Wed, Jun 28, 2023 at 7:14 PM Ilya Maximets <i.maxim...@ovn.org> wrote: >> >> On 6/28/23 05:27, Jason Wang wrote: >>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maxim...@ovn.org> wrote: >>>> >>>> On 6/27/23 04:54, Jason Wang wrote: >>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maxim...@ovn.org> wrote: >>>>>> >>>>>> On 6/26/23 08:32, Jason Wang wrote: >>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasow...@redhat.com> wrote: >>>>>>>> >>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maxim...@ovn.org> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> AF_XDP is a network socket family that allows communication directly >>>>>>>>> with the network device driver in the kernel, bypassing most or all >>>>>>>>> of the kernel networking stack. In the essence, the technology is >>>>>>>>> pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native >>>>>>>>> and works with any network interfaces without driver modifications. >>>>>>>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't >>>>>>>>> require access to character devices or unix sockets. Only access to >>>>>>>>> the network interface itself is necessary. >>>>>>>>> >>>>>>>>> This patch implements a network backend that communicates with the >>>>>>>>> kernel by creating an AF_XDP socket. A chunk of userspace memory >>>>>>>>> is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, >>>>>>>>> Fill and Completion) are placed in that memory along with a pool of >>>>>>>>> memory buffers for the packet data. Data transmission is done by >>>>>>>>> allocating one of the buffers, copying packet data into it and >>>>>>>>> placing the pointer into Tx ring. After transmission, device will >>>>>>>>> return the buffer via Completion ring. On Rx, device will take >>>>>>>>> a buffer form a pre-populated Fill ring, write the packet data into >>>>>>>>> it and place the buffer into Rx ring. >>>>>>>>> >>>>>>>>> AF_XDP network backend takes on the communication with the host >>>>>>>>> kernel and the network interface and forwards packets to/from the >>>>>>>>> peer device in QEMU. >>>>>>>>> >>>>>>>>> Usage example: >>>>>>>>> >>>>>>>>> -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C >>>>>>>>> -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1 >>>>>>>>> >>>>>>>>> XDP program bridges the socket with a network interface. It can be >>>>>>>>> attached to the interface in 2 different modes: >>>>>>>>> >>>>>>>>> 1. skb - this mode should work for any interface and doesn't require >>>>>>>>> driver support. With a caveat of lower performance. >>>>>>>>> >>>>>>>>> 2. native - this does require support from the driver and allows to >>>>>>>>> bypass skb allocation in the kernel and potentially use >>>>>>>>> zero-copy while getting packets in/out userspace. >>>>>>>>> >>>>>>>>> By default, QEMU will try to use native mode and fall back to skb. >>>>>>>>> Mode can be forced via 'mode' option. To force 'copy' even in native >>>>>>>>> mode, use 'force-copy=on' option. This might be useful if there is >>>>>>>>> some issue with the driver. >>>>>>>>> >>>>>>>>> Option 'queues=N' allows to specify how many device queues should >>>>>>>>> be open. Note that all the queues that are not open are still >>>>>>>>> functional and can receive traffic, but it will not be delivered to >>>>>>>>> QEMU. So, the number of device queues should generally match the >>>>>>>>> QEMU configuration, unless the device is shared with something >>>>>>>>> else and the traffic re-direction to appropriate queues is correctly >>>>>>>>> configured on a device level (e.g. with ethtool -N). >>>>>>>>> 'start-queue=M' option can be used to specify from which queue id >>>>>>>>> QEMU should start configuring 'N' queues. It might also be necessary >>>>>>>>> to use this option with certain NICs, e.g. MLX5 NICs. See the docs >>>>>>>>> for examples. >>>>>>>>> >>>>>>>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN >>>>>>>>> capabilities in order to load default XSK/XDP programs to the >>>>>>>>> network interface and configure BTF maps. >>>>>>>> >>>>>>>> I think you mean "BPF" actually? >>>>>> >>>>>> "BPF Type Format maps" kind of makes some sense, but yes. :) >>>>>> >>>>>>>> >>>>>>>>> It is possible, however, >>>>>>>>> to run only with CAP_NET_RAW. >>>>>>>> >>>>>>>> Qemu often runs without any privileges, so we need to fix it. >>>>>>>> >>>>>>>> I think adding support for SCM_RIGHTS via monitor would be a way to go. >>>>>> >>>>>> I looked through the code and it seems like we can run completely >>>>>> non-privileged as far as kernel concerned. We'll need an API >>>>>> modification in libxdp though. >>>>>> >>>>>> The thing is, IIUC, the only syscall that requires CAP_NET_RAW is >>>>>> a base socket creation. Binding and other configuration doesn't >>>>>> require any privileges. So, we could create a socket externally >>>>>> and pass it to QEMU. >>>>> >>>>> That's the way TAP works for example. >>>>> >>>>>> Should work, unless it's an oversight from >>>>>> the kernel side that needs to be patched. :) libxdp doesn't have >>>>>> a way to specify externally created socket today, so we'll need >>>>>> to change that. Should be easy to do though. I can explore. >>>>> >>>>> Please do that. >>>> >>>> I have a prototype: >>>> >>>> https://github.com/igsilya/xdp-tools/commit/db73e90945e3aa5e451ac88c42c83cb9389642d3 >>>> >>>> Need to test it out and then submit PR to xdp-tools project.
The change is now accepted: https://github.com/xdp-project/xdp-tools/commit/740c839806a02517da5bce7bd0ccaba908b3f675 I can update the QEMU patch with support for passing socket fds. It may look like this: -netved af-xdp,eth0,queues=2,inhibit=on,sock-fds=fd1,fd2 We'll need an fd per queue. And we may require these fds to be already added to the xsks map, so QEMU doesn't need xsks-map-fd. I'd say we'll need to compile support for that conditionally based on availability of xsk_umem__create_with_fd() as it may not be available in distributions for some time. Alternative is to require libxdp >= 1.4.0, which is not released yet. The last restriction will be that QEMU will need 32 MB of RLIMIT_MEMLOCK per queue for umem registration, but that should not be a huge deal, right? Alternative is to have CAP_IPC_LOCK. And I'd keep the xsks-map-fd parameter for setups that do not have latest libxdp and can allow CAP_NET_RAW. So, they could still do: -netdev af-xdp,eth0,queues=2,inhibit=on,xsks-map-fd=fd What do you think? >>>> >>>>> >>>>>> >>>>>> In case the bind syscall will actually need CAP_NET_RAW for some >>>>>> reason, we could change the kernel and allow non-privileged bind >>>>>> by utilizing, e.g. SO_BINDTODEVICE. i.e., let the privileged >>>>>> process bind the socket to a particular device, so QEMU can't >>>>>> bind it to a random one. Might be a good use case to allow even >>>>>> if not strictly necessary. >>>>> >>>>> Yes. >>>> >>>> Will propose something for a kernel as well. We might want something >>>> more granular though, e.g. bind to a queue instead of a device. In >>>> case we want better control in the device sharing scenario. >>> >>> I may miss something but the bind is already done at dev plus queue >>> right now, isn't it? >> >> >> Yes, the bind() syscall will bind socket to the dev+queue. I was talking >> about SO_BINDTODEVICE that only ties the socket to a particular device, >> but not a queue. >> >> Assuming SO_BINDTODEVICE is implemented for AF_XDP sockets and >> assuming a privileged process does: >> >> fd = socket(AF_XDP, ...); >> setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, <device>); >> >> And sends fd to a non-privileged process. That non-privileged process >> will be able to call: >> >> bind(fd, <device>, <random queue>); >> >> It will have to use the same device, but can choose any queue, if that >> queue is not already busy with another socket. >> >> So, I was thinking maybe implementing something like XDP_BINDTOQID option. >> This way the privileged process may call: >> >> setsockopt(fd, SOL_XDP, XDP_BINDTOQID, <device>, <queue>); >> >> And later kernel will be able to refuse bind() for any other queue for >> this particular socket. > > Not sure, if file descriptor passing works, we probably don't need another > way. > >> >> Not sure if that is necessary though. >> Since we're allocating the socket in the privileged process, that process >> may add the socket to the BPF map on the correct queue id. This way the >> non-privileged process will not be able to receive any packets from any >> other queue on this socket, even if bound to it. And no other AF_XDP >> socket will be able to be bound to that other queue as well. > > I think that's by design, or anything wrong with this model? No, should be fine. I'll posted a simple SO_BINDTODEVICE change to bpf-next as an RFC for now since the tree is closed: https://lore.kernel.org/netdev/20230630145831.2988845-1-i.maxim...@ovn.org/ Will re-send a non-RFC once it is open (after 10th of July, IIRC). > >> So, the >> rogue QEMU will be able to hog one extra queue, but it will not be able >> to intercept traffic any from it, AFAICT. May not be a huge problem >> after all. >> >> SO_BINDTODEVICE would still be nice to have. Especially for cases where >> we give the whole device to one VM. > > Then we need to use AF_XDP in the guest which seems to be a different > topic. Alibaba is working on the AF_XDP support for virtio-net. > > Thanks > >> >> Best regards, Ilya Maximets. >> >