Re: Designing a safe RX-zero-copy Memory Model for Networking

John Fastabend Wed, 14 Dec 2016 08:34:28 -0800

On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote:
> On Tue, 13 Dec 2016 12:08:21 -0800
> John Fastabend <[email protected]> wrote:
> 
>> On 16-12-13 11:53 AM, David Miller wrote:
>>> From: John Fastabend <[email protected]>
>>> Date: Tue, 13 Dec 2016 09:43:59 -0800
>>>   
>>>> What does "zero-copy send packet-pages to the application/socket that
>>>> requested this" mean? At the moment on x86 page-flipping appears to be
>>>> more expensive than memcpy (I can post some data shortly) and shared
>>>> memory was proposed and rejected for security reasons when we were
>>>> working on bifurcated driver.  
>>>
>>> The whole idea is that we map all the active RX ring pages into
>>> userspace from the start.
>>>
>>> And just how Jesper's page pool work will avoid DMA map/unmap,
>>> it will also avoid changing the userspace mapping of the pages
>>> as well.
>>>
>>> Thus avoiding the TLB/VM overhead altogether.
>>>   
> 
> Exactly.  It is worth mentioning that pages entering the page pool need
> to be cleared (measured cost 143 cycles), in order to not leak any
> kernel info.  The primary focus of this design is to make sure not to
> leak kernel info to userspace, but with an "exclusive" mode also
> support isolation between applications.
> 
> 
>> I get this but it requires applications to be isolated. The pages from
>> a queue can not be shared between multiple applications in different
>> trust domains. And the application has to be cooperative meaning it
>> can't "look" at data that has not been marked by the stack as OK. In
>> these schemes we tend to end up with something like virtio/vhost or
>> af_packet.
> 
> I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
> two would require CAP_NET_ADMIN privileges.  All modes have a trust
> domain id, that need to match e.g. when page reach the socket.


Even mode 3 should required cap_net_admin we don't want userspace to
grab queues off the nic without it IMO.

> 
> Mode-1 "Shared": Application choose lowest isolation level, allowing
>  multiple application to mmap VMA area.

My only point here is applications can read each others data and all
applications need to cooperate for example one app could try to write
continuously to read only pages causing faults and what not. This is
all non standard and doesn't play well with cgroups and "normal"
applications. It requires a new orchestration model.

I'm a bit skeptical of the use case but I know of a handful of reasons
to use this model. Maybe take a look at the ivshmem implementation in
DPDK.

Also this still requires a hardware filter to push "application" traffic
onto reserved queues/pages as far as I can tell.

> 
> Mode-2 "Single-user": Application request it want to be the only user
>  of the RX queue.  This blocks other application to mmap VMA area.
> 

Assuming data is read-only sharing with the stack is possibly OK :/. I
guess you would need to pools of memory for data and skb so you don't
leak skb into user space.

The devils in the details here. There are lots of hooks in the kernel
that can for example push the packet with a 'redirect' tc action for
example. And letting an app "read" data or impact performance of an
unrelated application is wrong IMO. Stacked devices also provide another
set of details that are a bit difficult to track down see all the
hardware offload efforts.

I assume all these concerns are shared between mode-1 and mode-2

> Mode-3 "Exclusive": Application request to own RX queue.  Packets are
>  no longer allowed for normal netstack delivery.
> 

I have patches for this mode already but haven't pushed them due to
an alternative solution using VFIO.

> Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
> still allowed to travel netstack and thus can contain packet data from
> other normal applications.  This is part of the design, to share the
> NIC between netstack and an accelerated userspace application using RX
> zero-copy delivery.
> 

I don't think this is acceptable to be honest. Letting an application
potentially read/impact other arbitrary applications on the system
seems like a non-starter even with CAP_NET_ADMIN. At least this was
the conclusion from bifurcated driver work some time ago.

> 
>> Any ACLs/filtering/switching/headers need to be done in hardware or
>> the application trust boundaries are broken.
> 
> The software solution outlined allow the application to make the choice
> of what trust boundary it wants.
> 
> The "exclusive" mode-3 make most sense together with HW filters.
> Already today, we support creating a new RX queue based on ethtool
> ntuple HW filter and then you simply attach your application that queue
> in mode-3, and have full isolation.
> 

Still pretty fuzzy on why mode-1 and mode-2 do not need hw filters?
Without hardware filters we have no way of knowing who/what data is
put in the page.

>  
>> If the above can not be met then a copy is needed. What I am trying
>> to tease out is the above comment along with other statements like
>> this "can be done with out HW filter features".
> 
> Does this address your concerns?
> 

I think we need to enforce strong isolation. An application should not
be able to read data or impact other applications. I gather this is
the case per comment about normal applications in mode-2. A slightly
weaker statement would be to say applications can only impace/read data
of other applications in their domain. This might be OK as well.

.John

Re: Designing a safe RX-zero-copy Memory Model for Networking

Reply via email to