Re: [PATCH] net: add initial support for AF_XDP network backend

Jason Wang Fri, 30 Jun 2023 00:42:15 -0700

On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefa...@gmail.com> wrote:
>
> On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasow...@redhat.com> wrote:
> >
> > On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefa...@gmail.com> wrote:
> > >
> > > On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasow...@redhat.com> wrote:
> > > >
> > > > On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi <stefa...@gmail.com> 
> > > > wrote:
> > > > >
> > > > > On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasow...@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi 
> > > > > > <stefa...@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasow...@redhat.com> 
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets 
> > > > > > > > <i.maxim...@ovn.org> wrote:
> > > > > > > > >
> > > > > > > > > On 6/27/23 04:54, Jason Wang wrote:
> > > > > > > > > > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets 
> > > > > > > > > > <i.maxim...@ovn.org> wrote:
> > > > > > > > > >>
> > > > > > > > > >> On 6/26/23 08:32, Jason Wang wrote:
> > > > > > > > > >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang 
> > > > > > > > > >>> <jasow...@redhat.com> wrote:
> > > > > > > > > >>>>
> > > > > > > > > >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets 
> > > > > > > > > >>>> <i.maxim...@ovn.org> wrote:
> > > > > > > > > >> It is noticeably more performant than a tap with vhost=on 
> > > > > > > > > >> in terms of PPS.
> > > > > > > > > >> So, that might be one case.  Taking into account that just 
> > > > > > > > > >> rcu lock and
> > > > > > > > > >> unlock in virtio-net code takes more time than a packet 
> > > > > > > > > >> copy, some batching
> > > > > > > > > >> on QEMU side should improve performance significantly.  
> > > > > > > > > >> And it shouldn't be
> > > > > > > > > >> too hard to implement.
> > > > > > > > > >>
> > > > > > > > > >> Performance over virtual interfaces may potentially be 
> > > > > > > > > >> improved by creating
> > > > > > > > > >> a kernel thread for async Tx.  Similarly to what io_uring 
> > > > > > > > > >> allows.  Currently
> > > > > > > > > >> Tx on non-zero-copy interfaces is synchronous, and that 
> > > > > > > > > >> doesn't allow to
> > > > > > > > > >> scale well.
> > > > > > > > > >
> > > > > > > > > > Interestingly, actually, there are a lot of "duplication" 
> > > > > > > > > > between
> > > > > > > > > > io_uring and AF_XDP:
> > > > > > > > > >
> > > > > > > > > > 1) both have similar memory model (user register)
> > > > > > > > > > 2) both use ring for communication
> > > > > > > > > >
> > > > > > > > > > I wonder if we can let io_uring talks directly to AF_XDP.
> > > > > > > > >
> > > > > > > > > Well, if we submit poll() in QEMU main loop via io_uring, 
> > > > > > > > > then we can
> > > > > > > > > avoid cost of the synchronous Tx for non-zero-copy modes, 
> > > > > > > > > i.e. for
> > > > > > > > > virtual interfaces.  io_uring thread in the kernel will be 
> > > > > > > > > able to
> > > > > > > > > perform transmission for us.
> > > > > > > >
> > > > > > > > It would be nice if we can use iothread/vhost other than the 
> > > > > > > > main loop
> > > > > > > > even if io_uring can use kthreads. We can avoid the memory 
> > > > > > > > translation
> > > > > > > > cost.
> > > > > > >
> > > > > > > The QEMU event loop (AioContext) has io_uring code
> > > > > > > (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm 
> > > > > > > working
> > > > > > > on patches to re-enable it and will probably send them in July. 
> > > > > > > The
> > > > > > > patches also add an API to submit arbitrary io_uring operations so
> > > > > > > that you can do stuff besides file descriptor monitoring. Both the
> > > > > > > main loop and IOThreads will be able to use io_uring on Linux 
> > > > > > > hosts.
> > > > > >
> > > > > > Just to make sure I understand. If we still need a copy from guest 
> > > > > > to
> > > > > > io_uring buffer, we still need to go via memory API for GPA which
> > > > > > seems expensive.
> > > > > >
> > > > > > Vhost seems to be a shortcut for this.
> > > > >
> > > > > I'm not sure how exactly you're thinking of using io_uring.
> > > > >
> > > > > Simply using io_uring for the event loop (file descriptor monitoring)
> > > > > doesn't involve an extra buffer, but the packet payload still needs to
> > > > > reside in AF_XDP umem, so there is a copy between guest memory and
> > > > > umem.
> > > >
> > > > So there would be a translation from GPA to HVA (unless io_uring
> > > > support 2 stages) which needs to go via qemu memory core. And this
> > > > part seems to be very expensive according to my test in the past.
> > >
> > > Yes, but in the current approach where AF_XDP is implemented as a QEMU
> > > netdev, there is already QEMU device emulation (e.g. virtio-net)
> > > happening. So the GPA to HVA translation will happen anyway in device
> > > emulation.
> >
> > Just to make sure we're on the same page.
> >
> > I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
> > QEMU netdev, it would be very hard to achieve that if we stick to
> > using the Qemu memory core translations which need to take care about
> > too much extra stuff. That's why I suggest using vhost in io threads
> > which only cares about ram so the translation could be very fast.
>
> What does using "vhost in io threads" mean?


It means a vhost userspace dataplane that is implemented via io threads.

> Is that a vhost kernel
> approach where userspace dedicates threads (the stuff that Mike
> Christie has been working on)? I haven't looked at how Mike's recent
> patches work, but I wouldn't call that approach QEMU IOThreads,
> because the threads probably don't run the AioContext event loop and
> instead execute vhost kernel code the entire time.
>
> But despite these questions, I think I'm beginning to understand that
> you're proposing a vhost_net.ko AF_XDP implementation instead of a
> userspace QEMU AF_XDP netdev implementation.

Sorry for being unclear, but I'm not proposing that.

> I wonder if any
> optimizations can be made when the AF_XDP user is kernel code instead
> of userspace code.

The only possible way to go is to adapt AF_XDP umem memory model to
vhost which I'm not sure of anything we can gain.

>
> > >
> > > Are you thinking about AF_XDP passthrough where the guest directly
> > > interacts with AF_XDP?
> >
> > This could be another way to solve, since it won't use Qemu's memory
> > core to do the translation.
> >
> > >
> > > > > If umem encompasses guest memory,
> > > >
> > > > It requires you to pin the whole guest memory and a GPA to HVA
> > > > translation is still required.
> > >
> > > Ilya mentioned that umem uses relative offsets instead of absolute
> > > memory addresses. In the AF_XDP passthrough case this means no address
> > > translation needs to be added to AF_XDP.
> >
> > I don't see how it can avoid the translations as it works at the level
> > of HVA. But what guests submit is PA or even IOVA.
>
> In a passthrough scenario the guest is doing AF_XDP, so it writes
> relative umem offsets, thereby eliminating address translation
> concerns (the addresses are not PAs or IOVAs). However, this approach
> probably won't work well with memory hotplug - or at least it will end
> up becoming a memory translation mechanism in order to support memory
> hotplug.

Ok.

>
> >
> > What's more, guest memory could be backed by different memory
> > backends, this means a single umem may not even work.
>
> Maybe. I don't know the nature of umem. If there can be multiple vmas
> in the umem range, then there should be no issue mixing different
> memory backends.

If I understand correctly, a single umem requires contiguous VA at least.

>
> >
> > >
> > > Regarding pinning - I wonder if that's something that can be refined
> > > in the kernel by adding an AF_XDP flag that enables on-demand pinning
> > > of umem. That way only rx and tx buffers that are currently in use
> > > will be pinned. The disadvantage is the runtime overhead to pin/unpin
> > > pages. I'm not sure whether it's possible to implement this, I haven't
> > > checked the kernel code.
> >
> > It requires the device to do page faults which is not commonly
> > supported nowadays.
>
> I don't understand this comment. AF_XDP processes each rx/tx
> descriptor. At that point it can getuserpages() or similar in order to
> pin the page. When the memory is no longer needed, it can put those
> pages. No fault mechanism is needed. What am I missing?

Ok, I think I kind of get you, you mean doing pinning while processing
rx/tx buffers? It's not easy since GUP itself is not very fast, it may
hit PPS for sure.

Thanks

>
> Stefan
>

Re: [PATCH] net: add initial support for AF_XDP network backend

Reply via email to