On 27 Feb 2019, at 11:17, Maxim Mikityanskiy wrote:
-----Original Message-----
From: Björn Töpel <bjorn.to...@gmail.com>
Sent: 27 February, 2019 10:09
To: John Fastabend <john.fastab...@gmail.com>
Cc: Maxim Mikityanskiy <maxi...@mellanox.com>;
netdev@vger.kernel.org; Björn
Töpel <bjorn.to...@intel.com>; Magnus Karlsson
<magnus.karls...@intel.com>;
David S. Miller <da...@davemloft.net>; Tariq Toukan
<tar...@mellanox.com>;
Saeed Mahameed <sae...@mellanox.com>; Eran Ben Elisha
<era...@mellanox.com>
Subject: Re: AF_XDP design flaws
On 2019-02-26 17:41, John Fastabend wrote:
On 2/26/19 6:49 AM, Maxim Mikityanskiy wrote:
Hi everyone,
Hi! It's exciting to see more vendors looking into AF_XDP. ARM was
involved early on, and now Mellanox. :-) It's really important that
the AF_XDP model is a good fit for all drivers/vendors! Thanks for
looking into this.
I would like to discuss some design flaws of AF_XDP socket (XSK)
implementation
in kernel. At the moment I don't see a way to work around them
without
changing
the API, so I would like to make sure that I'm not missing anything
and
to
suggest and discuss some possible improvements that can be made.
The issues I describe below are caused by the fact that the driver
depends on
the application doing some things, and if the application is
slow/buggy/malicious, the driver is forced to busy poll because of
the
lack of a
notification mechanism from the application side. I will refer to
the
i40e
driver implementation a lot, as it is the first implementation of
AF_XDP,
but
the issues are general and affect any driver. I already considered
trying
to fix
it on driver level, but it doesn't seem possible, so it looks like
the
behavior
and implementation of AF_XDP in the kernel has to be changed.
RX side busy polling
====================
On the RX side, the driver expects the application to put some
descriptors in
the Fill Ring. There is no way for the application to notify the
driver
that
there are more Fill Ring descriptors to take, so the driver is
forced to
busy
poll the Fill Ring if it gets empty. E.g., the i40e driver does it
in
NAPI poll:
int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
{
...
failure = failure ||
!i40e_alloc_rx_buffers_fast_zc(rx_ring,
cleaned_count);
...
return failure ? budget : (int)total_rx_packets;
}
Basically, it means that if there are no descriptors in the Fill
Ring,
NAPI will
never stop, draining CPU.
Possible cases when it happens
------------------------------
1. The application is slow, it received some frames in the RX Ring,
and
it is
still handling the data, so it has no free frames to put to the
Fill
Ring.
2. The application is malicious, it opens an XSK and puts no frames
to
the Fill
Ring. It can be used as a local DoS attack.
3. The application is buggy and stops filling the Fill Ring for
whatever
reason
(deadlock, waiting for another blocking operation, other bugs).
Although loading an XDP program requires root access, the DoS
attack can
be
targeted to setups that already use XDP, i.e. an XDP program is
already
loaded.
Even under root, userspace applications should not be able to
disrupt
system
stability by just calling normal APIs without an intention to
destroy the
system, and here it happens in case 1.
I believe this is by design. If packets per second is your top
priority
(at the expense of power, cpu, etc.) this is the behavior you might
want. To resolve your points if you don't trust the application it
should be isolated to a queue pair and cores so the impact is known
and
managed.
Correct, and I agree with John here. AF_XDP is a raw interface, and
needs to be managed.
OK, right, if you want high pps sacrificing system stability and
kernel-userspace isolation, XSK is at your service. It may be a valid
point. However, there is still the issue of unintended use.
I believe this is a double-edged sword - on one hand, it appears AF_XDP
is
aimed as a competitor to DPDK, and is focused only on raw packet speed,
at
the expense of usability. This isn't necessarily bad, but in my
experience,
if things aren't simple to use, then they tend not to get widely used.
If you use XSK, you can take measures to prevent the disruption, e.g.
run only trusted and properly tested applications or isolate them.
However, there is a case where XSKs are not really used on a given
machine, but are still available (CONFIG_XDP_SOCKETS=y), opening a
huge
hole. If you don't even know that XSK exists, you are still
vulnerable.
That said having a flag to back-off seems like a good idea. But
should
be doable as a feature without breaking current API.
A flag can be added, e.g., to the sxdp_flags field of struct
sockaddr_xdp, to switch to the behavior without busy polling on empty
Fill Ring.
However, I don't think the vulnerability can be eliminated while
keeping
the API intact. To protect the systems that don't use XSK, we need one
of these things:
- Disable XSK support by default, until explicitly enabled by root.
- Disable the old behavior by default, until explicitly enabled by
root.
Both of those things will require additional setup on machines that
use
XSK, after upgrading the kernel.
Possible way to solve the issue
-------------------------------
When the driver can't take new Fill Ring frames, it shouldn't busy
poll.
Instead, it signals the failure to the application (e.g., with
POLLERR),
and
after that it's up to the application to restart polling (e.g., by
calling
sendto()) after refilling the Fill Ring. The issue with this
approach is
that it
changes the API, so we either have to deal with it or to introduce
some
API
version field.
See above. I like the idea here though.
The uapi doesn't mandate *how* the driver should behave. The rational
for the aggressive i40e fill ring polling (when the ring is empty) is
to minimize the latency. The "fill ring is empty" behavior might be
different on different drivers.
If the behavior is different with different drivers, it will be
difficult to create applications that are portable across NICs. The
point of creating APIs is to provide a single interface to different
implementations, thus hiding the differences from a higher abstraction
level. What the driver does internally is up to the driver, but the
result visible in the userspace should be the same, so we can't just
make some drivers stop and wait when the Fill Ring is empty, while the
others busy poll. It should be controlled by some flags if we want to
preserve both options.
Or if the driver can't obtain new frames from the fill ring, it just
drops the incoming packet, and recycles the old frame in order to keep
the ring full. (This also assumes that the driver was able to initially
populate the ring in the first place, which eliminates one DoS attack of
never putting anything in the fill ring initially).
I do agree that issues cited in this thread can be resolved with the
use of flags to control the different behaviors - there just needs to
be some agreement on what those behaviors are.
--
Jonathan