On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefa...@gmail.com> wrote: > > On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasow...@redhat.com> wrote: > > > > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefa...@gmail.com> wrote: > > > > > > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasow...@redhat.com> wrote: > > > > > > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefa...@gmail.com> > > > > wrote: > > > > > > > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasow...@redhat.com> wrote: > > > > > > > > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi > > > > > > <stefa...@gmail.com> wrote: > > > > > > > > > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasow...@redhat.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets > > > > > > > > <i.maxim...@ovn.org> wrote: > > > > > > > > > > > > > > > > > > On 6/27/23 04:54, Jason Wang wrote: > > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets > > > > > > > > > > <i.maxim...@ovn.org> wrote: > > > > > > > > > >> > > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote: > > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang > > > > > > > > > >>> <jasow...@redhat.com> wrote: > > > > > > > > > >>>> > > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets > > > > > > > > > >>>> <i.maxim...@ovn.org> wrote: > > > > > > > > > >> It is noticeably more performant than a tap with vhost=on > > > > > > > > > >> in terms of PPS. > > > > > > > > > >> So, that might be one case. Taking into account that just > > > > > > > > > >> rcu lock and > > > > > > > > > >> unlock in virtio-net code takes more time than a packet > > > > > > > > > >> copy, some batching > > > > > > > > > >> on QEMU side should improve performance significantly. > > > > > > > > > >> And it shouldn't be > > > > > > > > > >> too hard to implement. > > > > > > > > > >> > > > > > > > > > >> Performance over virtual interfaces may potentially be > > > > > > > > > >> improved by creating > > > > > > > > > >> a kernel thread for async Tx. Similarly to what io_uring > > > > > > > > > >> allows. Currently > > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that > > > > > > > > > >> doesn't allow to > > > > > > > > > >> scale well. > > > > > > > > > > > > > > > > > > > > Interestingly, actually, there are a lot of "duplication" > > > > > > > > > > between > > > > > > > > > > io_uring and AF_XDP: > > > > > > > > > > > > > > > > > > > > 1) both have similar memory model (user register) > > > > > > > > > > 2) both use ring for communication > > > > > > > > > > > > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP. > > > > > > > > > > > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, > > > > > > > > > then we can > > > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, > > > > > > > > > i.e. for > > > > > > > > > virtual interfaces. io_uring thread in the kernel will be > > > > > > > > > able to > > > > > > > > > perform transmission for us. > > > > > > > > > > > > > > > > It would be nice if we can use iothread/vhost other than the > > > > > > > > main loop > > > > > > > > even if io_uring can use kthreads. We can avoid the memory > > > > > > > > translation > > > > > > > > cost. > > > > > > > > > > > > > > The QEMU event loop (AioContext) has io_uring code > > > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm > > > > > > > working > > > > > > > on patches to re-enable it and will probably send them in July. > > > > > > > The > > > > > > > patches also add an API to submit arbitrary io_uring operations so > > > > > > > that you can do stuff besides file descriptor monitoring. Both the > > > > > > > main loop and IOThreads will be able to use io_uring on Linux > > > > > > > hosts. > > > > > > > > > > > > Just to make sure I understand. If we still need a copy from guest > > > > > > to > > > > > > io_uring buffer, we still need to go via memory API for GPA which > > > > > > seems expensive. > > > > > > > > > > > > Vhost seems to be a shortcut for this. > > > > > > > > > > I'm not sure how exactly you're thinking of using io_uring. > > > > > > > > > > Simply using io_uring for the event loop (file descriptor monitoring) > > > > > doesn't involve an extra buffer, but the packet payload still needs to > > > > > reside in AF_XDP umem, so there is a copy between guest memory and > > > > > umem. > > > > > > > > So there would be a translation from GPA to HVA (unless io_uring > > > > support 2 stages) which needs to go via qemu memory core. And this > > > > part seems to be very expensive according to my test in the past. > > > > > > Yes, but in the current approach where AF_XDP is implemented as a QEMU > > > netdev, there is already QEMU device emulation (e.g. virtio-net) > > > happening. So the GPA to HVA translation will happen anyway in device > > > emulation. > > > > Just to make sure we're on the same page. > > > > I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the > > QEMU netdev, it would be very hard to achieve that if we stick to > > using the Qemu memory core translations which need to take care about > > too much extra stuff. That's why I suggest using vhost in io threads > > which only cares about ram so the translation could be very fast. > > What does using "vhost in io threads" mean?
It means a vhost userspace dataplane that is implemented via io threads. > Is that a vhost kernel > approach where userspace dedicates threads (the stuff that Mike > Christie has been working on)? I haven't looked at how Mike's recent > patches work, but I wouldn't call that approach QEMU IOThreads, > because the threads probably don't run the AioContext event loop and > instead execute vhost kernel code the entire time. > > But despite these questions, I think I'm beginning to understand that > you're proposing a vhost_net.ko AF_XDP implementation instead of a > userspace QEMU AF_XDP netdev implementation. Sorry for being unclear, but I'm not proposing that. > I wonder if any > optimizations can be made when the AF_XDP user is kernel code instead > of userspace code. The only possible way to go is to adapt AF_XDP umem memory model to vhost which I'm not sure of anything we can gain. > > > > > > > Are you thinking about AF_XDP passthrough where the guest directly > > > interacts with AF_XDP? > > > > This could be another way to solve, since it won't use Qemu's memory > > core to do the translation. > > > > > > > > > > If umem encompasses guest memory, > > > > > > > > It requires you to pin the whole guest memory and a GPA to HVA > > > > translation is still required. > > > > > > Ilya mentioned that umem uses relative offsets instead of absolute > > > memory addresses. In the AF_XDP passthrough case this means no address > > > translation needs to be added to AF_XDP. > > > > I don't see how it can avoid the translations as it works at the level > > of HVA. But what guests submit is PA or even IOVA. > > In a passthrough scenario the guest is doing AF_XDP, so it writes > relative umem offsets, thereby eliminating address translation > concerns (the addresses are not PAs or IOVAs). However, this approach > probably won't work well with memory hotplug - or at least it will end > up becoming a memory translation mechanism in order to support memory > hotplug. Ok. > > > > > What's more, guest memory could be backed by different memory > > backends, this means a single umem may not even work. > > Maybe. I don't know the nature of umem. If there can be multiple vmas > in the umem range, then there should be no issue mixing different > memory backends. If I understand correctly, a single umem requires contiguous VA at least. > > > > > > > > > Regarding pinning - I wonder if that's something that can be refined > > > in the kernel by adding an AF_XDP flag that enables on-demand pinning > > > of umem. That way only rx and tx buffers that are currently in use > > > will be pinned. The disadvantage is the runtime overhead to pin/unpin > > > pages. I'm not sure whether it's possible to implement this, I haven't > > > checked the kernel code. > > > > It requires the device to do page faults which is not commonly > > supported nowadays. > > I don't understand this comment. AF_XDP processes each rx/tx > descriptor. At that point it can getuserpages() or similar in order to > pin the page. When the memory is no longer needed, it can put those > pages. No fault mechanism is needed. What am I missing? Ok, I think I kind of get you, you mean doing pinning while processing rx/tx buffers? It's not easy since GUP itself is not very fast, it may hit PPS for sure. Thanks > > Stefan >