On 2018年02月16日 23:41, Jesper Dangaard Brouer wrote:
On Fri, 16 Feb 2018 13:31:37 +0800
Jason Wang <jasow...@redhat.com> wrote:

On 2018年02月16日 06:43, Jesper Dangaard Brouer wrote:
The virtio_net code have three different RX code-paths in receive_buf().
Two of these code paths can handle XDP, but one of them is broken for
at least XDP_REDIRECT.

Function(1): receive_big() does not support XDP.
Function(2): receive_small() support XDP fully and uses build_skb().
Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb().

The simple explanation is that receive_mergeable() is broken because
it uses napi_alloc_skb(), which violates XDP given XDP assumes packet
header+data in single page and enough tail room for skb_shared_info.

The longer explaination is that receive_mergeable() tries to
work-around and satisfy these XDP requiresments e.g. by having a
function xdp_linearize_page() that allocates and memcpy RX buffers
around (in case packet is scattered across multiple rx buffers).  This
does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because
we have not implemented bpf_xdp_adjust_tail yet).

The XDP_REDIRECT action combined with cpumap is broken, and cause hard
to debug crashes.  The main issue is that the RX packet does not have
the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing
skb_shared_info to overlap the next packets head-room (in which cpumap
stores info).

Reproducing depend on the packet payload length and if RX-buffer size
happened to have tail-room for skb_shared_info or not.  But to make
this even harder to troubleshoot, the RX-buffer size is runtime
dynamically change based on an Exponentially Weighted Moving Average
(EWMA) over the packet length, when refilling RX rings.

This patch only disable XDP_REDIRECT support in receive_mergeable()
case, because it can cause a real crash.

But IMHO we should NOT support XDP in receive_mergeable() at all,
because the principles behind XDP are to gain speed by (1) code
simplicity, (2) sacrificing memory and (3) where possible moving
runtime checks to setup time.  These principles are clearly being
violated in receive_mergeable(), that e.g. runtime track average
buffer size to save memory consumption.
I agree to disable it for -net now.
Okay... I'll send an official patch later.

For net-next, we probably can do:

- drop xdp_linearize_page() and do XDP through generic XDP helper
   after skb was built
I disagree strongly here - it makes no sense.

Why do you want to explicit fallback to Generic-XDP?
(... then all the performance gain is gone!)

Note this only happens when:

1) Rx buffer size is under estimated, we could disable estimation and then this won't happen 2) headroom is not sufficient, we try hard to not stop device during XDP set, so this can happen but only for the first several packets

So this looks pretty fine and remove a lot of complex codes.

And besides, a couple of function calls later, the generic XDP code
will/can get invoked anyhow...

How if we choose to use native mode of XDP?



Take a step back:
  What is the reason/use-case for implementing XDP inside virtio_net?

 From a DDoS/performance perspective XDP in virtio_net happens on the
"wrong-side" as it is activated _inside_ the guest OS, which is too
late for a DDoS filter, as the guest kick/switch overhead have already
occurred.

I don't see any difference of virtio-net now. Consider a real hardward NIC, XDP (except for the offload case) also start to drop packet after it reach hardware.


I do use XDP_DROP inside the guest (driver virtio_net), but just to
perform what I can zoom-in benchmarking, for perf-record isolating the
early RX code path in the guest.  (Using iptables "raw" table drop is
almost as useful for that purpose).


We could not assume the type of virtio-net backend, it could be dpdk or other high speed implementation. And I'm pretty sure there are more use cases, here are two:

- Use XDP to accelerate nest VM
- XDP offload to host


The XDP ndo_xdp_xmit in tuntap/tun.c (that you also implemented) is
significantly more interesting.  As it allow us to skip large parts of
the network stack and redirect from a physical device (ixgbe) into a
guest device.  Ran a benchmark:
  - 0.5 Mpps with normal code path into device with driver tun
  - 3.7 Mpps with XDP_REDIRECT from ixgbe into same device

Plus, there are indications that 3.7Mpps is not the real limit, as
guest CPU doing XDP_DROP is 75% idle... thus this is a likely a
scheduling + queue size issue.

Yes, XDP_DROP can do more (but I forget the exact number). Btw testpmd (in guest) can give me about 3Mpps when doing forwarding (io mode). The main bottleneck in this case is vhost, XDP_REDIRECT can provides about 8Mpps to tun, but vhost can only receive about 4Mpps, and vhost tx can only have 3Mpps.

Thanks



- disable EWMA when XDP is set and reserve enough tailroom.

Besides the described bug:

Update(1): There is also a OOM leak in the XDP_REDIRECT code, which
receive_small() is likely also affected by.

Update(2): Also observed a guest crash when redirecting out an
another virtio_net device, when device is down.
Will have a look at these issues. (Holiday in china now, so will do it
after).



Reply via email to