On 27 Feb 2019, at 11:17, Maxim Mikityanskiy wrote:

-----Original Message-----
From: Björn Töpel <bjorn.to...@gmail.com>
Sent: 27 February, 2019 10:09
To: John Fastabend <john.fastab...@gmail.com>
Cc: Maxim Mikityanskiy <maxi...@mellanox.com>; netdev@vger.kernel.org; Björn Töpel <bjorn.to...@intel.com>; Magnus Karlsson <magnus.karls...@intel.com>; David S. Miller <da...@davemloft.net>; Tariq Toukan <tar...@mellanox.com>; Saeed Mahameed <sae...@mellanox.com>; Eran Ben Elisha <era...@mellanox.com>
Subject: Re: AF_XDP design flaws

On 2019-02-26 17:41, John Fastabend wrote:
On 2/26/19 6:49 AM, Maxim Mikityanskiy wrote:
Hi everyone,


Hi! It's exciting to see more vendors looking into AF_XDP. ARM was
involved early on, and now Mellanox. :-) It's really important that
the AF_XDP model is a good fit for all drivers/vendors! Thanks for
looking into this.

I would like to discuss some design flaws of AF_XDP socket (XSK)
implementation
in kernel. At the moment I don't see a way to work around them without
changing
the API, so I would like to make sure that I'm not missing anything and
to
suggest and discuss some possible improvements that can be made.

The issues I describe below are caused by the fact that the driver
depends on
the application doing some things, and if the application is
slow/buggy/malicious, the driver is forced to busy poll because of the
lack of a
notification mechanism from the application side. I will refer to the
i40e
driver implementation a lot, as it is the first implementation of AF_XDP,
but
the issues are general and affect any driver. I already considered trying
to fix
it on driver level, but it doesn't seem possible, so it looks like the
behavior
and implementation of AF_XDP in the kernel has to be changed.

RX side busy polling
====================

On the RX side, the driver expects the application to put some
descriptors in
the Fill Ring. There is no way for the application to notify the driver
that
there are more Fill Ring descriptors to take, so the driver is forced to
busy
poll the Fill Ring if it gets empty. E.g., the i40e driver does it in
NAPI poll:

int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
{
...
                         failure = failure ||

!i40e_alloc_rx_buffers_fast_zc(rx_ring,

cleaned_count);
...
         return failure ? budget : (int)total_rx_packets;
}

Basically, it means that if there are no descriptors in the Fill Ring,
NAPI will
never stop, draining CPU.

Possible cases when it happens
------------------------------

1. The application is slow, it received some frames in the RX Ring, and
it is
still handling the data, so it has no free frames to put to the Fill
Ring.

2. The application is malicious, it opens an XSK and puts no frames to
the Fill
Ring. It can be used as a local DoS attack.

3. The application is buggy and stops filling the Fill Ring for whatever
reason
(deadlock, waiting for another blocking operation, other bugs).

Although loading an XDP program requires root access, the DoS attack can
be
targeted to setups that already use XDP, i.e. an XDP program is already
loaded.
Even under root, userspace applications should not be able to disrupt
system
stability by just calling normal APIs without an intention to destroy the
system, and here it happens in case 1.
I believe this is by design. If packets per second is your top priority
(at the expense of power, cpu, etc.) this is the behavior you might
want. To resolve your points if you don't trust the application it
should be isolated to a queue pair and cores so the impact is known and
managed.


Correct, and I agree with John here. AF_XDP is a raw interface, and
needs to be managed.

OK, right, if you want high pps sacrificing system stability and
kernel-userspace isolation, XSK is at your service. It may be a valid
point. However, there is still the issue of unintended use.

I believe this is a double-edged sword - on one hand, it appears AF_XDP is aimed as a competitor to DPDK, and is focused only on raw packet speed, at the expense of usability. This isn't necessarily bad, but in my experience,
if things aren't simple to use, then they tend not to get widely used.

If you use XSK, you can take measures to prevent the disruption, e.g.
run only trusted and properly tested applications or isolate them.
However, there is a case where XSKs are not really used on a given
machine, but are still available (CONFIG_XDP_SOCKETS=y), opening a huge hole. If you don't even know that XSK exists, you are still vulnerable.

That said having a flag to back-off seems like a good idea. But should
be doable as a feature without breaking current API.

A flag can be added, e.g., to the sxdp_flags field of struct
sockaddr_xdp, to switch to the behavior without busy polling on empty
Fill Ring.

However, I don't think the vulnerability can be eliminated while keeping
the API intact. To protect the systems that don't use XSK, we need one
of these things:

- Disable XSK support by default, until explicitly enabled by root.

- Disable the old behavior by default, until explicitly enabled by root.

Both of those things will require additional setup on machines that use
XSK, after upgrading the kernel.

Possible way to solve the issue
-------------------------------

When the driver can't take new Fill Ring frames, it shouldn't busy poll. Instead, it signals the failure to the application (e.g., with POLLERR),
and
after that it's up to the application to restart polling (e.g., by
calling
sendto()) after refilling the Fill Ring. The issue with this approach is
that it
changes the API, so we either have to deal with it or to introduce some
API
version field.
See above. I like the idea here though.


The uapi doesn't mandate *how* the driver should behave. The rational
for the aggressive i40e fill ring polling (when the ring is empty) is
to minimize the latency. The "fill ring is empty" behavior might be
different on different drivers.

If the behavior is different with different drivers, it will be
difficult to create applications that are portable across NICs. The
point of creating APIs is to provide a single interface to different
implementations, thus hiding the differences from a higher abstraction
level. What the driver does internally is up to the driver, but the
result visible in the userspace should be the same, so we can't just
make some drivers stop and wait when the Fill Ring is empty, while the
others busy poll. It should be controlled by some flags if we want to
preserve both options.

Or if the driver can't obtain new frames from the fill ring, it just
drops the incoming packet, and recycles the old frame in order to keep
the ring full.  (This also assumes that the driver was able to initially
populate the ring in the first place, which eliminates one DoS attack of
never putting anything in the fill ring initially).

I do agree that issues cited in this thread can be resolved with the
use of flags to control the different behaviors - there just needs to
be some agreement on what those behaviors are.
--
Jonathan

Reply via email to