Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-15 Thread Christoph Lameter
On Thu, 15 Dec 2016, Jesper Dangaard Brouer wrote:

> > It sounds like Christoph's RDMA approach might be the way to go.
>
> I'm getting more and more fond of Christoph's RDMA approach.  I do
> think we will end-up with something close to that approach.  I just
> wanted to get review on my idea first.
>
> IMHO the major blocker for the RDMA approach is not HW filters
> themselves, but a common API that applications can call to register
> what goes into the HW queues in the driver.  I suspect it will be a
> long project agreeing between vendors.  And agreeing on semantics.

Some of the methods from the RDMA subsystem (like queue pairs, the various
queues etc) could be extracted and used here. Multiple vendors already
support these features and some devices operate both in an RDMA and a
network stack mode. Having that all supported by the networks stack would
reduce overhead for those vendors.

Multiple new vendors are coming up in the RDMA subsystem because the
regular network stack does not have the right performance for high speed
networking. I would rather see them have a way to get that functionality
from the regular network stack. Please add some extensions so that the
RDMA style I/O can be made to work. Even the hardware of the new NICs is
already prepared to work with the data structures of the RDMA subsystem.
That provides an area of standardization where we could hook into but do
that properly and in a nice way in the context of main stream network
support.


Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-15 Thread Alexander Duyck
On Thu, Dec 15, 2016 at 12:28 AM, Jesper Dangaard Brouer
 wrote:
> On Wed, 14 Dec 2016 14:45:00 -0800
> Alexander Duyck  wrote:
>
>> On Wed, Dec 14, 2016 at 1:29 PM, Jesper Dangaard Brouer
>>  wrote:
>> > On Wed, 14 Dec 2016 08:45:08 -0800
>> > Alexander Duyck  wrote:
>> >
>> >> I agree.  This is a no-go from the performance perspective as well.
>> >> At a minimum you would have to be zeroing out the page between uses to
>> >> avoid leaking data, and that assumes that the program we are sending
>> >> the pages to is slightly well behaved.  If we think zeroing out an
>> >> sk_buff is expensive wait until we are trying to do an entire 4K page.
>> >
>> > Again, yes the page will be zero'ed out, but only when entering the
>> > page_pool. Because they are recycled they are not cleared on every use.
>> > Thus, performance does not suffer.
>>
>> So you are talking about recycling, but not clearing the page when it
>> is recycled.  That right there is my problem with this.  It is fine if
>> you assume the pages are used by the application only, but you are
>> talking about using them for both the application and for the regular
>> network path.  You can't do that.  If you are recycling you will have
>> to clear the page every time you put it back onto the Rx ring,
>> otherwise you can leak the recycled memory into user space and end up
>> with a user space program being able to snoop data out of the skb.
>>
>> > Besides clearing large mem area is not as bad as clearing small.
>> > Clearing an entire page does cost something, as mentioned before 143
>> > cycles, which is 28 bytes-per-cycle (4096/143).  And clearing 256 bytes
>> > cost 36 cycles which is only 7 bytes-per-cycle (256/36).
>>
>> What I am saying is that you are going to be clearing the 4K blocks
>> each time they are recycled.  You can't have the pages shared between
>> user-space and the network stack unless you have true isolation.  If
>> you are allowing network stack pages to be recycled back into the
>> user-space application you open up all sorts of leaks where the
>> application can snoop into data it shouldn't have access to.
>
> See later, the "Read-only packet page" mode should provide a mode where
> the netstack doesn't write into the page, and thus cannot leak kernel
> data. (CAP_NET_ADMIN already give it access to other applications data.)

I think you are kind of missing the point.  The device is writing to
the page on the kernel's behalf.  Therefore the page isn't "Read-only"
and you have an issue since you are talking about sharing a ring
between kernel and userspace.

>> >> I think we are stuck with having to use a HW filter to split off
>> >> application traffic to a specific ring, and then having to share the
>> >> memory between the application and the kernel on that ring only.  Any
>> >> other approach just opens us up to all sorts of security concerns
>> >> since it would be possible for the application to try to read and
>> >> possibly write any data it wants into the buffers.
>> >
>> > This is why I wrote a document[1], trying to outline how this is possible,
>> > going through all the combinations, and asking the community to find
>> > faults in my idea.  Inlining it again, as nobody really replied on the
>> > content of the doc.
>> >
>> > -
>> > Best regards,
>> >   Jesper Dangaard Brouer
>> >   MSc.CS, Principal Kernel Engineer at Red Hat
>> >   LinkedIn: http://www.linkedin.com/in/brouer
>> >
>> > [1] 
>> > https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
>> >
>> > ===
>> > Memory Model for Networking
>> > ===
>> >
>> > This design describes how the page_pool change the memory model for
>> > networking in the NIC (Network Interface Card) drivers.
>> >
>> > .. Note:: The catch for driver developers is that, once an application
>> >   request zero-copy RX, then the driver must use a specific
>> >   SKB allocation mode and might have to reconfigure the
>> >   RX-ring.
>> >
>> >
>> > Design target
>> > =
>> >
>> > Allow the NIC to function as a normal Linux NIC and be shared in a
>> > safe manor, between the kernel network stack and an accelerated
>> > userspace application using RX zero-copy delivery.
>> >
>> > Target is to provide the basis for building RX zero-copy solutions in
>> > a memory safe manor.  An efficient communication channel for userspace
>> > delivery is out of scope for this document, but OOM considerations are
>> > discussed below (`Userspace delivery and OOM`_).
>> >
>> > Background
>> > ==
>> >
>> > The SKB or ``struct sk_buff`` is the fundamental meta-data structure
>> > for network packets in the Linux Kernel network stack.  It is a fairly
>> > complex object and can be constructed in several ways.
>> >
>> > From a memory perspective there are two ways depending on
>> > 

Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-15 Thread Jesper Dangaard Brouer
On Wed, 14 Dec 2016 14:45:00 -0800
Alexander Duyck  wrote:

> On Wed, Dec 14, 2016 at 1:29 PM, Jesper Dangaard Brouer
>  wrote:
> > On Wed, 14 Dec 2016 08:45:08 -0800
> > Alexander Duyck  wrote:
> >  
> >> I agree.  This is a no-go from the performance perspective as well.
> >> At a minimum you would have to be zeroing out the page between uses to
> >> avoid leaking data, and that assumes that the program we are sending
> >> the pages to is slightly well behaved.  If we think zeroing out an
> >> sk_buff is expensive wait until we are trying to do an entire 4K page.  
> >
> > Again, yes the page will be zero'ed out, but only when entering the
> > page_pool. Because they are recycled they are not cleared on every use.
> > Thus, performance does not suffer.  
> 
> So you are talking about recycling, but not clearing the page when it
> is recycled.  That right there is my problem with this.  It is fine if
> you assume the pages are used by the application only, but you are
> talking about using them for both the application and for the regular
> network path.  You can't do that.  If you are recycling you will have
> to clear the page every time you put it back onto the Rx ring,
> otherwise you can leak the recycled memory into user space and end up
> with a user space program being able to snoop data out of the skb.
> 
> > Besides clearing large mem area is not as bad as clearing small.
> > Clearing an entire page does cost something, as mentioned before 143
> > cycles, which is 28 bytes-per-cycle (4096/143).  And clearing 256 bytes
> > cost 36 cycles which is only 7 bytes-per-cycle (256/36).  
> 
> What I am saying is that you are going to be clearing the 4K blocks
> each time they are recycled.  You can't have the pages shared between
> user-space and the network stack unless you have true isolation.  If
> you are allowing network stack pages to be recycled back into the
> user-space application you open up all sorts of leaks where the
> application can snoop into data it shouldn't have access to.

See later, the "Read-only packet page" mode should provide a mode where
the netstack doesn't write into the page, and thus cannot leak kernel
data. (CAP_NET_ADMIN already give it access to other applications data.)


> >> I think we are stuck with having to use a HW filter to split off
> >> application traffic to a specific ring, and then having to share the
> >> memory between the application and the kernel on that ring only.  Any
> >> other approach just opens us up to all sorts of security concerns
> >> since it would be possible for the application to try to read and
> >> possibly write any data it wants into the buffers.  
> >
> > This is why I wrote a document[1], trying to outline how this is possible,
> > going through all the combinations, and asking the community to find
> > faults in my idea.  Inlining it again, as nobody really replied on the
> > content of the doc.
> >
> > -
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   LinkedIn: http://www.linkedin.com/in/brouer
> >
> > [1] 
> > https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
> >
> > ===
> > Memory Model for Networking
> > ===
> >
> > This design describes how the page_pool change the memory model for
> > networking in the NIC (Network Interface Card) drivers.
> >
> > .. Note:: The catch for driver developers is that, once an application
> >   request zero-copy RX, then the driver must use a specific
> >   SKB allocation mode and might have to reconfigure the
> >   RX-ring.
> >
> >
> > Design target
> > =
> >
> > Allow the NIC to function as a normal Linux NIC and be shared in a
> > safe manor, between the kernel network stack and an accelerated
> > userspace application using RX zero-copy delivery.
> >
> > Target is to provide the basis for building RX zero-copy solutions in
> > a memory safe manor.  An efficient communication channel for userspace
> > delivery is out of scope for this document, but OOM considerations are
> > discussed below (`Userspace delivery and OOM`_).
> >
> > Background
> > ==
> >
> > The SKB or ``struct sk_buff`` is the fundamental meta-data structure
> > for network packets in the Linux Kernel network stack.  It is a fairly
> > complex object and can be constructed in several ways.
> >
> > From a memory perspective there are two ways depending on
> > RX-buffer/page state:
> >
> > 1) Writable packet page
> > 2) Read-only packet page
> >
> > To take full potential of the page_pool, the drivers must actually
> > support handling both options depending on the configuration state of
> > the page_pool.
> >
> > Writable packet page
> > 
> >
> > When the RX packet page is writable, the SKB setup is fairly straight
> > forward.  The SKB->data (and 

Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-14 Thread Alexander Duyck
On Wed, Dec 14, 2016 at 1:29 PM, Jesper Dangaard Brouer
 wrote:
> On Wed, 14 Dec 2016 08:45:08 -0800
> Alexander Duyck  wrote:
>
>> I agree.  This is a no-go from the performance perspective as well.
>> At a minimum you would have to be zeroing out the page between uses to
>> avoid leaking data, and that assumes that the program we are sending
>> the pages to is slightly well behaved.  If we think zeroing out an
>> sk_buff is expensive wait until we are trying to do an entire 4K page.
>
> Again, yes the page will be zero'ed out, but only when entering the
> page_pool. Because they are recycled they are not cleared on every use.
> Thus, performance does not suffer.

So you are talking about recycling, but not clearing the page when it
is recycled.  That right there is my problem with this.  It is fine if
you assume the pages are used by the application only, but you are
talking about using them for both the application and for the regular
network path.  You can't do that.  If you are recycling you will have
to clear the page every time you put it back onto the Rx ring,
otherwise you can leak the recycled memory into user space and end up
with a user space program being able to snoop data out of the skb.

> Besides clearing large mem area is not as bad as clearing small.
> Clearing an entire page does cost something, as mentioned before 143
> cycles, which is 28 bytes-per-cycle (4096/143).  And clearing 256 bytes
> cost 36 cycles which is only 7 bytes-per-cycle (256/36).

What I am saying is that you are going to be clearing the 4K blocks
each time they are recycled.  You can't have the pages shared between
user-space and the network stack unless you have true isolation.  If
you are allowing network stack pages to be recycled back into the
user-space application you open up all sorts of leaks where the
application can snoop into data it shouldn't have access to.

>> I think we are stuck with having to use a HW filter to split off
>> application traffic to a specific ring, and then having to share the
>> memory between the application and the kernel on that ring only.  Any
>> other approach just opens us up to all sorts of security concerns
>> since it would be possible for the application to try to read and
>> possibly write any data it wants into the buffers.
>
> This is why I wrote a document[1], trying to outline how this is possible,
> going through all the combinations, and asking the community to find
> faults in my idea.  Inlining it again, as nobody really replied on the
> content of the doc.
>
> -
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>
> [1] 
> https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
>
> ===
> Memory Model for Networking
> ===
>
> This design describes how the page_pool change the memory model for
> networking in the NIC (Network Interface Card) drivers.
>
> .. Note:: The catch for driver developers is that, once an application
>   request zero-copy RX, then the driver must use a specific
>   SKB allocation mode and might have to reconfigure the
>   RX-ring.
>
>
> Design target
> =
>
> Allow the NIC to function as a normal Linux NIC and be shared in a
> safe manor, between the kernel network stack and an accelerated
> userspace application using RX zero-copy delivery.
>
> Target is to provide the basis for building RX zero-copy solutions in
> a memory safe manor.  An efficient communication channel for userspace
> delivery is out of scope for this document, but OOM considerations are
> discussed below (`Userspace delivery and OOM`_).
>
> Background
> ==
>
> The SKB or ``struct sk_buff`` is the fundamental meta-data structure
> for network packets in the Linux Kernel network stack.  It is a fairly
> complex object and can be constructed in several ways.
>
> From a memory perspective there are two ways depending on
> RX-buffer/page state:
>
> 1) Writable packet page
> 2) Read-only packet page
>
> To take full potential of the page_pool, the drivers must actually
> support handling both options depending on the configuration state of
> the page_pool.
>
> Writable packet page
> 
>
> When the RX packet page is writable, the SKB setup is fairly straight
> forward.  The SKB->data (and skb->head) can point directly to the page
> data, adjusting the offset according to drivers headroom (for adding
> headers) and setting the length according to the DMA descriptor info.
>
> The page/data need to be writable, because the network stack need to
> adjust headers (like TimeToLive and checksum) or even add or remove
> headers for encapsulation purposes.
>
> A subtle catch, which also requires a writable page, is that the SKB
> also have an accompanying "shared info" data-structure ``struct
> 

Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-14 Thread Jesper Dangaard Brouer
On Wed, 14 Dec 2016 08:45:08 -0800
Alexander Duyck  wrote:

> I agree.  This is a no-go from the performance perspective as well.
> At a minimum you would have to be zeroing out the page between uses to
> avoid leaking data, and that assumes that the program we are sending
> the pages to is slightly well behaved.  If we think zeroing out an
> sk_buff is expensive wait until we are trying to do an entire 4K page.

Again, yes the page will be zero'ed out, but only when entering the
page_pool. Because they are recycled they are not cleared on every use.
Thus, performance does not suffer.

Besides clearing large mem area is not as bad as clearing small.
Clearing an entire page does cost something, as mentioned before 143
cycles, which is 28 bytes-per-cycle (4096/143).  And clearing 256 bytes
cost 36 cycles which is only 7 bytes-per-cycle (256/36).


> I think we are stuck with having to use a HW filter to split off
> application traffic to a specific ring, and then having to share the
> memory between the application and the kernel on that ring only.  Any
> other approach just opens us up to all sorts of security concerns
> since it would be possible for the application to try to read and
> possibly write any data it wants into the buffers.

This is why I wrote a document[1], trying to outline how this is possible,
going through all the combinations, and asking the community to find
faults in my idea.  Inlining it again, as nobody really replied on the
content of the doc.

- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] 
https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html

===
Memory Model for Networking
===

This design describes how the page_pool change the memory model for
networking in the NIC (Network Interface Card) drivers.

.. Note:: The catch for driver developers is that, once an application
  request zero-copy RX, then the driver must use a specific
  SKB allocation mode and might have to reconfigure the
  RX-ring.


Design target
=

Allow the NIC to function as a normal Linux NIC and be shared in a
safe manor, between the kernel network stack and an accelerated
userspace application using RX zero-copy delivery.

Target is to provide the basis for building RX zero-copy solutions in
a memory safe manor.  An efficient communication channel for userspace
delivery is out of scope for this document, but OOM considerations are
discussed below (`Userspace delivery and OOM`_).

Background
==

The SKB or ``struct sk_buff`` is the fundamental meta-data structure
for network packets in the Linux Kernel network stack.  It is a fairly
complex object and can be constructed in several ways.

>From a memory perspective there are two ways depending on
RX-buffer/page state:

1) Writable packet page
2) Read-only packet page

To take full potential of the page_pool, the drivers must actually
support handling both options depending on the configuration state of
the page_pool.

Writable packet page


When the RX packet page is writable, the SKB setup is fairly straight
forward.  The SKB->data (and skb->head) can point directly to the page
data, adjusting the offset according to drivers headroom (for adding
headers) and setting the length according to the DMA descriptor info.

The page/data need to be writable, because the network stack need to
adjust headers (like TimeToLive and checksum) or even add or remove
headers for encapsulation purposes.

A subtle catch, which also requires a writable page, is that the SKB
also have an accompanying "shared info" data-structure ``struct
skb_shared_info``.  This "skb_shared_info" is written into the
skb->data memory area at the end (skb->end) of the (header) data.  The
skb_shared_info contains semi-sensitive information, like kernel
memory pointers to other pages (which might be pointers to more packet
data).  This would be bad from a zero-copy point of view to leak this
kind of information.

Read-only packet page
-

When the RX packet page is read-only, the construction of the SKB is
significantly more complicated and even involves one more memory
allocation.

1) Allocate a new separate writable memory area, and point skb->data
   here.  This is needed due to (above described) skb_shared_info.

2) Memcpy packet headers into this (skb->data) area.

3) Clear part of skb_shared_info struct in writable-area.

4) Setup pointer to packet-data in the page (in skb_shared_info->frags)
   and adjust the page_offset to be past the headers just copied.

It is useful (later) that the network stack have this notion that part
of the packet and a page can be read-only.  This implies that the
kernel will not "pollute" this memory with any sensitive information.
This is good from a zero-copy point of view, but 

Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-14 Thread Christoph Lameter
On Wed, 14 Dec 2016, Hannes Frederic Sowa wrote:

> Wouldn't changing of the pages cause expensive TLB flushes?

Yes so you would only want that feature if its realized at the page
table level for debugging issues.

Once you have memory registered with the hardware device then also the
device could itself could perform snooping to realize that data was
changed and thus abort the operation.


Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-14 Thread Jesper Dangaard Brouer
On Wed, 14 Dec 2016 08:32:10 -0800
John Fastabend  wrote:

> On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote:
> > On Tue, 13 Dec 2016 12:08:21 -0800
> > John Fastabend  wrote:
> >   
> >> On 16-12-13 11:53 AM, David Miller wrote:  
> >>> From: John Fastabend 
> >>> Date: Tue, 13 Dec 2016 09:43:59 -0800
> >>> 
>  What does "zero-copy send packet-pages to the application/socket that
>  requested this" mean? At the moment on x86 page-flipping appears to be
>  more expensive than memcpy (I can post some data shortly) and shared
>  memory was proposed and rejected for security reasons when we were
>  working on bifurcated driver.
> >>>
> >>> The whole idea is that we map all the active RX ring pages into
> >>> userspace from the start.
> >>>
> >>> And just how Jesper's page pool work will avoid DMA map/unmap,
> >>> it will also avoid changing the userspace mapping of the pages
> >>> as well.
> >>>
> >>> Thus avoiding the TLB/VM overhead altogether.
> >>> 
> > 
> > Exactly.  It is worth mentioning that pages entering the page pool need
> > to be cleared (measured cost 143 cycles), in order to not leak any
> > kernel info.  The primary focus of this design is to make sure not to
> > leak kernel info to userspace, but with an "exclusive" mode also
> > support isolation between applications.
> > 
> >   
> >> I get this but it requires applications to be isolated. The pages from
> >> a queue can not be shared between multiple applications in different
> >> trust domains. And the application has to be cooperative meaning it
> >> can't "look" at data that has not been marked by the stack as OK. In
> >> these schemes we tend to end up with something like virtio/vhost or
> >> af_packet.  
> > 
> > I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
> > two would require CAP_NET_ADMIN privileges.  All modes have a trust
> > domain id, that need to match e.g. when page reach the socket.  
> 
> Even mode 3 should required cap_net_admin we don't want userspace to
> grab queues off the nic without it IMO.

Good point.

> > 
> > Mode-1 "Shared": Application choose lowest isolation level, allowing
> >  multiple application to mmap VMA area.  
> 
> My only point here is applications can read each others data and all
> applications need to cooperate for example one app could try to write
> continuously to read only pages causing faults and what not. This is
> all non standard and doesn't play well with cgroups and "normal"
> applications. It requires a new orchestration model.
> 
> I'm a bit skeptical of the use case but I know of a handful of reasons
> to use this model. Maybe take a look at the ivshmem implementation in
> DPDK.
> 
> Also this still requires a hardware filter to push "application" traffic
> onto reserved queues/pages as far as I can tell.
> 
> > 
> > Mode-2 "Single-user": Application request it want to be the only user
> >  of the RX queue.  This blocks other application to mmap VMA area.
> >   
> 
> Assuming data is read-only sharing with the stack is possibly OK :/. I
> guess you would need to pools of memory for data and skb so you don't
> leak skb into user space.

Yes, as describe in orig email and here[1]: "once an application
request zero-copy RX, then the driver must use a specific SKB
allocation mode and might have to reconfigure the RX-ring."

The SKB allocation mode is "read-only packet page", which is the
current default mode (also desc in document[1]) of using skb-frags.

[1] 
https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
 
> The devils in the details here. There are lots of hooks in the kernel
> that can for example push the packet with a 'redirect' tc action for
> example. And letting an app "read" data or impact performance of an
> unrelated application is wrong IMO. Stacked devices also provide another
> set of details that are a bit difficult to track down see all the
> hardware offload efforts.
> 
> I assume all these concerns are shared between mode-1 and mode-2
> 
> > Mode-3 "Exclusive": Application request to own RX queue.  Packets are
> >  no longer allowed for normal netstack delivery.
> >   
> 
> I have patches for this mode already but haven't pushed them due to
> an alternative solution using VFIO.

Interesting.

> > Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
> > still allowed to travel netstack and thus can contain packet data from
> > other normal applications.  This is part of the design, to share the
> > NIC between netstack and an accelerated userspace application using RX
> > zero-copy delivery.
> >   
> 
> I don't think this is acceptable to be honest. Letting an application
> potentially read/impact other arbitrary applications on the system
> seems like a non-starter even with CAP_NET_ADMIN. At least this was
> the conclusion from bifurcated driver work some time ago.

I 

Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-14 Thread Hannes Frederic Sowa
On 14.12.2016 20:43, Christoph Lameter wrote:
> On Wed, 14 Dec 2016, David Laight wrote:
> 
>> If the kernel is doing ANY validation on the frames it must copy the
>> data to memory the application cannot modify before doing the validation.
>> Otherwise the application could change the data afterwards.
> 
> The application is not allowed to change the data after a work request has
> been submitted to send the frame. Changes are possible after the
> completion request has been received.
> 
> The kernel can enforce that by making the frame(s) readonly and thus
> getting a page fault if the app would do such a thing.

As far as I remember right now, if you gift with vmsplice the memory
over a pipe to a tcp socket, you can in fact change the user data while
the data is in transmit. So you should not touch the memory region until
you received a SOF_TIMESTAMPING_TX_ACK error message in your sockets
error queue or stuff might break horribly. I don't think we have a
proper event for UDP that fires after we know the data left the hardware.

In my opinion this is still fine within the kernel protection limits.
E.g. due to scatter gather I/O you don't get access to the TCP header
nor UDP header and thus can't e.g. spoof or modify the header or
administration policies, albeit TOCTTOU races with netfilter which
matches inside the TCP/UDP packets are very well possible on transmit.

Wouldn't changing of the pages cause expensive TLB flushes?

Bye,
Hannes



RE: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-14 Thread Christoph Lameter
On Wed, 14 Dec 2016, David Laight wrote:

> If the kernel is doing ANY validation on the frames it must copy the
> data to memory the application cannot modify before doing the validation.
> Otherwise the application could change the data afterwards.

The application is not allowed to change the data after a work request has
been submitted to send the frame. Changes are possible after the
completion request has been received.

The kernel can enforce that by making the frame(s) readonly and thus
getting a page fault if the app would do such a thing.



RE: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-14 Thread David Laight
From: Christoph Lameter
> Sent: 14 December 2016 17:00
> On Tue, 13 Dec 2016, Hannes Frederic Sowa wrote:
> 
> > > Interesting.  So you even imagine sockets registering memory regions
> > > with the NIC.  If we had a proper NIC HW filter API across the drivers,
> > > to register the steering rule (like ibv_create_flow), this would be
> > > doable, but we don't (DPDK actually have an interesting proposal[1])
> >
> > On a side note, this is what windows does with RIO ("registered I/O").
> > Maybe you want to look at the API to get some ideas: allocating and
> > pinning down memory in user space and registering that with sockets to
> > get zero-copy IO.
> 
> Yup that is also what I think. Regarding the memory registration and flow
> steering for user space RX/TX ring please look at the qpair model
> implemented by the RDMA subsystem in the kernel. The memory semantics are
> clearly established there and have been in use for more than a decade.

Isn't there a bigger problem for transmit?
If the kernel is doing ANY validation on the frames it must copy the
data to memory the application cannot modify before doing the validation.
Otherwise the application could change the data afterwards.

David




Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-14 Thread Alexander Duyck
On Wed, Dec 14, 2016 at 8:32 AM, John Fastabend
 wrote:
> On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote:
>> On Tue, 13 Dec 2016 12:08:21 -0800
>> John Fastabend  wrote:
>>
>>> On 16-12-13 11:53 AM, David Miller wrote:
 From: John Fastabend 
 Date: Tue, 13 Dec 2016 09:43:59 -0800

> What does "zero-copy send packet-pages to the application/socket that
> requested this" mean? At the moment on x86 page-flipping appears to be
> more expensive than memcpy (I can post some data shortly) and shared
> memory was proposed and rejected for security reasons when we were
> working on bifurcated driver.

 The whole idea is that we map all the active RX ring pages into
 userspace from the start.

 And just how Jesper's page pool work will avoid DMA map/unmap,
 it will also avoid changing the userspace mapping of the pages
 as well.

 Thus avoiding the TLB/VM overhead altogether.

>>
>> Exactly.  It is worth mentioning that pages entering the page pool need
>> to be cleared (measured cost 143 cycles), in order to not leak any
>> kernel info.  The primary focus of this design is to make sure not to
>> leak kernel info to userspace, but with an "exclusive" mode also
>> support isolation between applications.
>>
>>
>>> I get this but it requires applications to be isolated. The pages from
>>> a queue can not be shared between multiple applications in different
>>> trust domains. And the application has to be cooperative meaning it
>>> can't "look" at data that has not been marked by the stack as OK. In
>>> these schemes we tend to end up with something like virtio/vhost or
>>> af_packet.
>>
>> I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
>> two would require CAP_NET_ADMIN privileges.  All modes have a trust
>> domain id, that need to match e.g. when page reach the socket.
>
> Even mode 3 should required cap_net_admin we don't want userspace to
> grab queues off the nic without it IMO.
>
>>
>> Mode-1 "Shared": Application choose lowest isolation level, allowing
>>  multiple application to mmap VMA area.
>
> My only point here is applications can read each others data and all
> applications need to cooperate for example one app could try to write
> continuously to read only pages causing faults and what not. This is
> all non standard and doesn't play well with cgroups and "normal"
> applications. It requires a new orchestration model.
>
> I'm a bit skeptical of the use case but I know of a handful of reasons
> to use this model. Maybe take a look at the ivshmem implementation in
> DPDK.
>
> Also this still requires a hardware filter to push "application" traffic
> onto reserved queues/pages as far as I can tell.
>
>>
>> Mode-2 "Single-user": Application request it want to be the only user
>>  of the RX queue.  This blocks other application to mmap VMA area.
>>
>
> Assuming data is read-only sharing with the stack is possibly OK :/. I
> guess you would need to pools of memory for data and skb so you don't
> leak skb into user space.
>
> The devils in the details here. There are lots of hooks in the kernel
> that can for example push the packet with a 'redirect' tc action for
> example. And letting an app "read" data or impact performance of an
> unrelated application is wrong IMO. Stacked devices also provide another
> set of details that are a bit difficult to track down see all the
> hardware offload efforts.
>
> I assume all these concerns are shared between mode-1 and mode-2
>
>> Mode-3 "Exclusive": Application request to own RX queue.  Packets are
>>  no longer allowed for normal netstack delivery.
>>
>
> I have patches for this mode already but haven't pushed them due to
> an alternative solution using VFIO.
>
>> Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
>> still allowed to travel netstack and thus can contain packet data from
>> other normal applications.  This is part of the design, to share the
>> NIC between netstack and an accelerated userspace application using RX
>> zero-copy delivery.
>>
>
> I don't think this is acceptable to be honest. Letting an application
> potentially read/impact other arbitrary applications on the system
> seems like a non-starter even with CAP_NET_ADMIN. At least this was
> the conclusion from bifurcated driver work some time ago.

I agree.  This is a no-go from the performance perspective as well.
At a minimum you would have to be zeroing out the page between uses to
avoid leaking data, and that assumes that the program we are sending
the pages to is slightly well behaved.  If we think zeroing out an
sk_buff is expensive wait until we are trying to do an entire 4K page.

I think we are stuck with having to use a HW filter to split off
application traffic to a specific ring, and then having to share the
memory between the application and the kernel on that ring only.  Any
other 

Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-14 Thread Christoph Lameter
On Tue, 13 Dec 2016, Hannes Frederic Sowa wrote:

> > Interesting.  So you even imagine sockets registering memory regions
> > with the NIC.  If we had a proper NIC HW filter API across the drivers,
> > to register the steering rule (like ibv_create_flow), this would be
> > doable, but we don't (DPDK actually have an interesting proposal[1])
>
> On a side note, this is what windows does with RIO ("registered I/O").
> Maybe you want to look at the API to get some ideas: allocating and
> pinning down memory in user space and registering that with sockets to
> get zero-copy IO.

Yup that is also what I think. Regarding the memory registration and flow
steering for user space RX/TX ring please look at the qpair model
implemented by the RDMA subsystem in the kernel. The memory semantics are
clearly established there and have been in use for more than a decade.



Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-14 Thread John Fastabend
On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote:
> On Tue, 13 Dec 2016 12:08:21 -0800
> John Fastabend  wrote:
> 
>> On 16-12-13 11:53 AM, David Miller wrote:
>>> From: John Fastabend 
>>> Date: Tue, 13 Dec 2016 09:43:59 -0800
>>>   
 What does "zero-copy send packet-pages to the application/socket that
 requested this" mean? At the moment on x86 page-flipping appears to be
 more expensive than memcpy (I can post some data shortly) and shared
 memory was proposed and rejected for security reasons when we were
 working on bifurcated driver.  
>>>
>>> The whole idea is that we map all the active RX ring pages into
>>> userspace from the start.
>>>
>>> And just how Jesper's page pool work will avoid DMA map/unmap,
>>> it will also avoid changing the userspace mapping of the pages
>>> as well.
>>>
>>> Thus avoiding the TLB/VM overhead altogether.
>>>   
> 
> Exactly.  It is worth mentioning that pages entering the page pool need
> to be cleared (measured cost 143 cycles), in order to not leak any
> kernel info.  The primary focus of this design is to make sure not to
> leak kernel info to userspace, but with an "exclusive" mode also
> support isolation between applications.
> 
> 
>> I get this but it requires applications to be isolated. The pages from
>> a queue can not be shared between multiple applications in different
>> trust domains. And the application has to be cooperative meaning it
>> can't "look" at data that has not been marked by the stack as OK. In
>> these schemes we tend to end up with something like virtio/vhost or
>> af_packet.
> 
> I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
> two would require CAP_NET_ADMIN privileges.  All modes have a trust
> domain id, that need to match e.g. when page reach the socket.

Even mode 3 should required cap_net_admin we don't want userspace to
grab queues off the nic without it IMO.

> 
> Mode-1 "Shared": Application choose lowest isolation level, allowing
>  multiple application to mmap VMA area.

My only point here is applications can read each others data and all
applications need to cooperate for example one app could try to write
continuously to read only pages causing faults and what not. This is
all non standard and doesn't play well with cgroups and "normal"
applications. It requires a new orchestration model.

I'm a bit skeptical of the use case but I know of a handful of reasons
to use this model. Maybe take a look at the ivshmem implementation in
DPDK.

Also this still requires a hardware filter to push "application" traffic
onto reserved queues/pages as far as I can tell.

> 
> Mode-2 "Single-user": Application request it want to be the only user
>  of the RX queue.  This blocks other application to mmap VMA area.
> 

Assuming data is read-only sharing with the stack is possibly OK :/. I
guess you would need to pools of memory for data and skb so you don't
leak skb into user space.

The devils in the details here. There are lots of hooks in the kernel
that can for example push the packet with a 'redirect' tc action for
example. And letting an app "read" data or impact performance of an
unrelated application is wrong IMO. Stacked devices also provide another
set of details that are a bit difficult to track down see all the
hardware offload efforts.

I assume all these concerns are shared between mode-1 and mode-2

> Mode-3 "Exclusive": Application request to own RX queue.  Packets are
>  no longer allowed for normal netstack delivery.
> 

I have patches for this mode already but haven't pushed them due to
an alternative solution using VFIO.

> Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
> still allowed to travel netstack and thus can contain packet data from
> other normal applications.  This is part of the design, to share the
> NIC between netstack and an accelerated userspace application using RX
> zero-copy delivery.
> 

I don't think this is acceptable to be honest. Letting an application
potentially read/impact other arbitrary applications on the system
seems like a non-starter even with CAP_NET_ADMIN. At least this was
the conclusion from bifurcated driver work some time ago.

> 
>> Any ACLs/filtering/switching/headers need to be done in hardware or
>> the application trust boundaries are broken.
> 
> The software solution outlined allow the application to make the choice
> of what trust boundary it wants.
> 
> The "exclusive" mode-3 make most sense together with HW filters.
> Already today, we support creating a new RX queue based on ethtool
> ntuple HW filter and then you simply attach your application that queue
> in mode-3, and have full isolation.
> 

Still pretty fuzzy on why mode-1 and mode-2 do not need hw filters?
Without hardware filters we have no way of knowing who/what data is
put in the page.

>  
>> If the above can not be met then a copy is needed. What I am trying
>> to tease out is the above 

Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-14 Thread Jesper Dangaard Brouer
On Tue, 13 Dec 2016 12:08:21 -0800
John Fastabend  wrote:

> On 16-12-13 11:53 AM, David Miller wrote:
> > From: John Fastabend 
> > Date: Tue, 13 Dec 2016 09:43:59 -0800
> >   
> >> What does "zero-copy send packet-pages to the application/socket that
> >> requested this" mean? At the moment on x86 page-flipping appears to be
> >> more expensive than memcpy (I can post some data shortly) and shared
> >> memory was proposed and rejected for security reasons when we were
> >> working on bifurcated driver.  
> > 
> > The whole idea is that we map all the active RX ring pages into
> > userspace from the start.
> > 
> > And just how Jesper's page pool work will avoid DMA map/unmap,
> > it will also avoid changing the userspace mapping of the pages
> > as well.
> > 
> > Thus avoiding the TLB/VM overhead altogether.
> >   

Exactly.  It is worth mentioning that pages entering the page pool need
to be cleared (measured cost 143 cycles), in order to not leak any
kernel info.  The primary focus of this design is to make sure not to
leak kernel info to userspace, but with an "exclusive" mode also
support isolation between applications.


> I get this but it requires applications to be isolated. The pages from
> a queue can not be shared between multiple applications in different
> trust domains. And the application has to be cooperative meaning it
> can't "look" at data that has not been marked by the stack as OK. In
> these schemes we tend to end up with something like virtio/vhost or
> af_packet.

I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
two would require CAP_NET_ADMIN privileges.  All modes have a trust
domain id, that need to match e.g. when page reach the socket.

Mode-1 "Shared": Application choose lowest isolation level, allowing
 multiple application to mmap VMA area.

Mode-2 "Single-user": Application request it want to be the only user
 of the RX queue.  This blocks other application to mmap VMA area.

Mode-3 "Exclusive": Application request to own RX queue.  Packets are
 no longer allowed for normal netstack delivery.

Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
still allowed to travel netstack and thus can contain packet data from
other normal applications.  This is part of the design, to share the
NIC between netstack and an accelerated userspace application using RX
zero-copy delivery.


> Any ACLs/filtering/switching/headers need to be done in hardware or
> the application trust boundaries are broken.

The software solution outlined allow the application to make the choice
of what trust boundary it wants.

The "exclusive" mode-3 make most sense together with HW filters.
Already today, we support creating a new RX queue based on ethtool
ntuple HW filter and then you simply attach your application that queue
in mode-3, and have full isolation.

 
> If the above can not be met then a copy is needed. What I am trying
> to tease out is the above comment along with other statements like
> this "can be done with out HW filter features".

Does this address your concerns?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-13 Thread John Fastabend
On 16-12-13 11:53 AM, David Miller wrote:
> From: John Fastabend 
> Date: Tue, 13 Dec 2016 09:43:59 -0800
> 
>> What does "zero-copy send packet-pages to the application/socket that
>> requested this" mean? At the moment on x86 page-flipping appears to be
>> more expensive than memcpy (I can post some data shortly) and shared
>> memory was proposed and rejected for security reasons when we were
>> working on bifurcated driver.
> 
> The whole idea is that we map all the active RX ring pages into
> userspace from the start.
> 
> And just how Jesper's page pool work will avoid DMA map/unmap,
> it will also avoid changing the userspace mapping of the pages
> as well.
> 
> Thus avoiding the TLB/VM overhead altogether.
> 

I get this but it requires applications to be isolated. The pages from
a queue can not be shared between multiple applications in different
trust domains. And the application has to be cooperative meaning it
can't "look" at data that has not been marked by the stack as OK. In
these schemes we tend to end up with something like virtio/vhost or
af_packet.

Any ACLs/filtering/switching/headers need to be done in hardware or
the application trust boundaries are broken.

If the above can not be met then a copy is needed. What I am trying
to tease out is the above comment along with other statements like
this "can be done with out HW filter features".

.John


Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-13 Thread David Miller
From: John Fastabend 
Date: Tue, 13 Dec 2016 09:43:59 -0800

> What does "zero-copy send packet-pages to the application/socket that
> requested this" mean? At the moment on x86 page-flipping appears to be
> more expensive than memcpy (I can post some data shortly) and shared
> memory was proposed and rejected for security reasons when we were
> working on bifurcated driver.

The whole idea is that we map all the active RX ring pages into
userspace from the start.

And just how Jesper's page pool work will avoid DMA map/unmap,
it will also avoid changing the userspace mapping of the pages
as well.

Thus avoiding the TLB/VM overhead altogether.


Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-13 Thread Hannes Frederic Sowa
On 13.12.2016 17:10, Jesper Dangaard Brouer wrote:
>> What is bad about RDMA is that it is a separate kernel subsystem.
>> What I would like to see is a deeper integration with the network
>> stack so that memory regions can be registred with a network socket
>> and work requests then can be submitted and processed that directly
>> read and write in these regions. The network stack should provide the
>> services that the hardware of the NIC does not suppport as usual.
> 
> Interesting.  So you even imagine sockets registering memory regions
> with the NIC.  If we had a proper NIC HW filter API across the drivers,
> to register the steering rule (like ibv_create_flow), this would be
> doable, but we don't (DPDK actually have an interesting proposal[1])

On a side note, this is what windows does with RIO ("registered I/O").
Maybe you want to look at the API to get some ideas: allocating and
pinning down memory in user space and registering that with sockets to
get zero-copy IO.



Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-13 Thread John Fastabend
On 16-12-13 08:10 AM, Jesper Dangaard Brouer wrote:
> 
> On Mon, 12 Dec 2016 12:06:59 -0600 (CST) Christoph Lameter  
> wrote:
>> On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote:
>>
>>> Hmmm. If you can rely on hardware setup to give you steering and
>>> dedicated access to the RX rings.  In those cases, I guess, the "push"
>>> model could be a more direct API approach.  
>>
>> If the hardware does not support steering then one should be able to
>> provide those services in software.
> 
> This is the early demux problem.  With the push-mode of registering
> memory, you need hardware steering support, for zero-copy support, as
> the software step happens after DMA engine have written into the memory.
> 
> My model pre-VMA map all the pages in the RX ring (if zero-copy gets
> enabled, by a single user).  The software step can filter and zero-copy
> send packet-pages to the application/socket that requested this. The

What does "zero-copy send packet-pages to the application/socket that
requested this" mean? At the moment on x86 page-flipping appears to be
more expensive than memcpy (I can post some data shortly) and shared
memory was proposed and rejected for security reasons when we were
working on bifurcated driver.

> disadvantage is all zero-copy application need to share this VMA
> mapping.  This is solved by configuring HW filters into a RX-queue, and
> then only attach your zero-copy application to that queue.
> 
> 
>>> I was shooting for a model that worked without hardware support.
>>> And then transparently benefit from HW support by configuring a HW
>>> filter into a specific RX queue and attaching/using to that queue.  
>>
>> The discussion here is a bit amusing since these issues have been
>> resolved a long time ago with the design of the RDMA subsystem. Zero
>> copy is already in wide use. Memory registration is used to pin down
>> memory areas. Work requests can be filed with the RDMA subsystem that
>> then send and receive packets from the registered memory regions.
>> This is not strictly remote memory access but this is a basic mode of
>> operations supported  by the RDMA subsystem. The mlx5 driver quoted
>> here supports all of that.
> 
> I hear what you are saying.  I will look into a push-model, as it might
> be a better solution.
>  I will read up on RDMA + verbs and learn more about their API model.  I
> even plan to write a small sample program to get a feeling for the API,
> and maybe we can use that as a baseline for the performance target we
> can obtain on the same HW. (Thanks to Björn for already giving me some
> pointer here)
> 
> 
>> What is bad about RDMA is that it is a separate kernel subsystem.
>> What I would like to see is a deeper integration with the network
>> stack so that memory regions can be registred with a network socket
>> and work requests then can be submitted and processed that directly
>> read and write in these regions. The network stack should provide the
>> services that the hardware of the NIC does not suppport as usual.
> 
> Interesting.  So you even imagine sockets registering memory regions
> with the NIC.  If we had a proper NIC HW filter API across the drivers,
> to register the steering rule (like ibv_create_flow), this would be
> doable, but we don't (DPDK actually have an interesting proposal[1])
> 

Note rte_flow is in the same family of APIs as the proposed Flow API
that was rejected as well.  The features in Flow API that are not
included in the rte_flow proposal have logical extensions to support
them. In kernel we have 'tc' and multiple vendors support cls_flower
and cls_tc which offer a subset of the functionality in the DPDK
implementation.

Are you suggesting 'tc' is not a proper NIC HW filter API?

>  
>> The RX/TX ring in user space should be an additional mode of
>> operation of the socket layer. Once that is in place the "Remote
>> memory acces" can be trivially implemented on top of that and the
>> ugly RDMA sidecar subsystem can go away.
>  
> I cannot follow that 100%, but I guess you are saying we also need a
> more efficient mode of handing over pages/packet to userspace (than
> going through the normal socket API calls).
> 
> 
> Appreciate your input, it challenged my thinking.
> 



Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-13 Thread Christoph Lameter
On Tue, 13 Dec 2016, Jesper Dangaard Brouer wrote:

> This is the early demux problem.  With the push-mode of registering
> memory, you need hardware steering support, for zero-copy support, as
> the software step happens after DMA engine have written into the memory.

Right. But we could fall back to software. Transfer to a kernel buffer and
then move stuff over. Not much of an improvment but it will make things
work.

> > The discussion here is a bit amusing since these issues have been
> > resolved a long time ago with the design of the RDMA subsystem. Zero
> > copy is already in wide use. Memory registration is used to pin down
> > memory areas. Work requests can be filed with the RDMA subsystem that
> > then send and receive packets from the registered memory regions.
> > This is not strictly remote memory access but this is a basic mode of
> > operations supported  by the RDMA subsystem. The mlx5 driver quoted
> > here supports all of that.
>
> I hear what you are saying.  I will look into a push-model, as it might
> be a better solution.
>  I will read up on RDMA + verbs and learn more about their API model.  I
> even plan to write a small sample program to get a feeling for the API,
> and maybe we can use that as a baseline for the performance target we
> can obtain on the same HW. (Thanks to Björn for already giving me some
> pointer here)

Great.

> > What is bad about RDMA is that it is a separate kernel subsystem.
> > What I would like to see is a deeper integration with the network
> > stack so that memory regions can be registred with a network socket
> > and work requests then can be submitted and processed that directly
> > read and write in these regions. The network stack should provide the
> > services that the hardware of the NIC does not suppport as usual.
>
> Interesting.  So you even imagine sockets registering memory regions
> with the NIC.  If we had a proper NIC HW filter API across the drivers,
> to register the steering rule (like ibv_create_flow), this would be
> doable, but we don't (DPDK actually have an interesting proposal[1])

Well doing this would mean adding some features and that also would at
best allow general support for zero copy direct to user space with a
fallback to software if the hardware is missing some feature.

> > The RX/TX ring in user space should be an additional mode of
> > operation of the socket layer. Once that is in place the "Remote
> > memory acces" can be trivially implemented on top of that and the
> > ugly RDMA sidecar subsystem can go away.
>
> I cannot follow that 100%, but I guess you are saying we also need a
> more efficient mode of handing over pages/packet to userspace (than
> going through the normal socket API calls).

A work request contains the user space address of the data to be sent
and/or received. The address must be in a registered memory region. This
is different from copying the packet into kernel data structures.

I think this can easily be generalized. We need support for registering
memory regions, submissions of work request and the processing of
completion requets. QP (queue-pair) processing is probably the basis for
the whole scheme that is used in multiple context these days.


Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-13 Thread Jesper Dangaard Brouer

On Mon, 12 Dec 2016 12:06:59 -0600 (CST) Christoph Lameter  
wrote:
> On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote:
> 
> > Hmmm. If you can rely on hardware setup to give you steering and
> > dedicated access to the RX rings.  In those cases, I guess, the "push"
> > model could be a more direct API approach.  
> 
> If the hardware does not support steering then one should be able to
> provide those services in software.

This is the early demux problem.  With the push-mode of registering
memory, you need hardware steering support, for zero-copy support, as
the software step happens after DMA engine have written into the memory.

My model pre-VMA map all the pages in the RX ring (if zero-copy gets
enabled, by a single user).  The software step can filter and zero-copy
send packet-pages to the application/socket that requested this. The
disadvantage is all zero-copy application need to share this VMA
mapping.  This is solved by configuring HW filters into a RX-queue, and
then only attach your zero-copy application to that queue.


> > I was shooting for a model that worked without hardware support.
> > And then transparently benefit from HW support by configuring a HW
> > filter into a specific RX queue and attaching/using to that queue.  
> 
> The discussion here is a bit amusing since these issues have been
> resolved a long time ago with the design of the RDMA subsystem. Zero
> copy is already in wide use. Memory registration is used to pin down
> memory areas. Work requests can be filed with the RDMA subsystem that
> then send and receive packets from the registered memory regions.
> This is not strictly remote memory access but this is a basic mode of
> operations supported  by the RDMA subsystem. The mlx5 driver quoted
> here supports all of that.

I hear what you are saying.  I will look into a push-model, as it might
be a better solution.
 I will read up on RDMA + verbs and learn more about their API model.  I
even plan to write a small sample program to get a feeling for the API,
and maybe we can use that as a baseline for the performance target we
can obtain on the same HW. (Thanks to Björn for already giving me some
pointer here)


> What is bad about RDMA is that it is a separate kernel subsystem.
> What I would like to see is a deeper integration with the network
> stack so that memory regions can be registred with a network socket
> and work requests then can be submitted and processed that directly
> read and write in these regions. The network stack should provide the
> services that the hardware of the NIC does not suppport as usual.

Interesting.  So you even imagine sockets registering memory regions
with the NIC.  If we had a proper NIC HW filter API across the drivers,
to register the steering rule (like ibv_create_flow), this would be
doable, but we don't (DPDK actually have an interesting proposal[1])

 
> The RX/TX ring in user space should be an additional mode of
> operation of the socket layer. Once that is in place the "Remote
> memory acces" can be trivially implemented on top of that and the
> ugly RDMA sidecar subsystem can go away.
 
I cannot follow that 100%, but I guess you are saying we also need a
more efficient mode of handing over pages/packet to userspace (than
going through the normal socket API calls).


Appreciate your input, it challenged my thinking.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] https://rawgit.com/6WIND/rte_flow/master/rte_flow.html


Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-13 Thread Mike Rapoport
On Mon, Dec 12, 2016 at 06:49:03AM -0800, John Fastabend wrote:
> On 16-12-12 06:14 AM, Mike Rapoport wrote:
> >>
> > We were not considered using XDP yet, so we've decided to limit the initial
> > implementation to macvtap because we can ensure correspondence between a
> > NIC queue and virtual NIC, which is not the case with more generic tap
> > device. It could be that use of XDP will allow for a generic solution for
> > virtio case as well.
> 
> Interesting this was one of the original ideas behind the macvlan
> offload mode. iirc Vlad also was interested in this.
> 
> I'm guessing this was used because of the ability to push macvlan onto
> its own queue?

Yes, with a queue dedicated to a virtual NIC we only need to ensure that
guest memory is used for RX buffers. 
 
> >>
> >>> Have you considered using "push" model for setting the NIC's RX memory?
> >>
> >> I don't understand what you mean by a "push" model?
> > 
> > Currently, memory allocation in NIC drivers boils down to alloc_page with
> > some wrapping code. I see two possible ways to make NIC use of some
> > preallocated pages: either NIC driver will call an API (probably different
> > from alloc_page) to obtain that memory, or there will be NDO API that
> > allows to set the NIC's RX buffers. I named the later case "push".
> 
> I prefer the ndo op. This matches up well with AF_PACKET model where we
> have "slots" and offload is just a transparent "push" of these "slots"
> to the driver. Below we have a snippet of our proposed API,
> 
> (https://patchwork.ozlabs.org/patch/396714/ note the descriptor mapping
> bits will be dropped)
> 
> + * int (*ndo_direct_qpair_page_map) (struct vm_area_struct *vma,
> + *struct net_device *dev)
> + *   Called to map queue pair range from split_queue_pairs into
> + *   mmap region.
> +
> 
> > +
> > +static int
> > +ixgbe_ndo_qpair_page_map(struct vm_area_struct *vma, struct net_device 
> > *dev)
> > +{
> > +   struct ixgbe_adapter *adapter = netdev_priv(dev);
> > +   phys_addr_t phy_addr = pci_resource_start(adapter->pdev, 0);
> > +   unsigned long pfn_rx = (phy_addr + RX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
> > +   unsigned long pfn_tx = (phy_addr + TX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
> > +   unsigned long dummy_page_phy;
> > +   pgprot_t pre_vm_page_prot;
> > +   unsigned long start;
> > +   unsigned int i;
> > +   int err;
> > +
> > +   if (!dummy_page_buf) {
> > +   dummy_page_buf = kzalloc(PAGE_SIZE_4K, GFP_KERNEL);
> > +   if (!dummy_page_buf)
> > +   return -ENOMEM;
> > +
> > +   for (i = 0; i < PAGE_SIZE_4K / sizeof(unsigned int); i++)
> > +   dummy_page_buf[i] = 0xdeadbeef;
> > +   }
> > +
> > +   dummy_page_phy = virt_to_phys(dummy_page_buf);
> > +   pre_vm_page_prot = vma->vm_page_prot;
> > +   vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > +
> > +   /* assume the vm_start is 4K aligned address */
> > +   for (start = vma->vm_start;
> > +start < vma->vm_end;
> > +start += PAGE_SIZE_4K) {
> > +   if (start == vma->vm_start + RX_DESC_ADDR_OFFSET) {
> > +   err = remap_pfn_range(vma, start, pfn_rx, PAGE_SIZE_4K,
> > + vma->vm_page_prot);
> > +   if (err)
> > +   return -EAGAIN;
> > +   } else if (start == vma->vm_start + TX_DESC_ADDR_OFFSET) {
> > +   err = remap_pfn_range(vma, start, pfn_tx, PAGE_SIZE_4K,
> > + vma->vm_page_prot);
> > +   if (err)
> > +   return -EAGAIN;
> > +   } else {
> > +   unsigned long addr = dummy_page_phy > PAGE_SHIFT;
> > +
> > +   err = remap_pfn_range(vma, start, addr, PAGE_SIZE_4K,
> > + pre_vm_page_prot);
> > +   if (err)
> > +   return -EAGAIN;
> > +   }
> > +   }
> > +   return 0;
> > +}
> > +
> 
> Any thoughts on something like the above? We could push it when net-next
> opens. One piece that fits naturally into vhost/macvtap is the kicks and
> queue splicing are already there so no need to implement this making the
> above patch much simpler.

Sorry, but I don't quite follow you here. The vhost does not use vma
mappings, it just sees a bunch of pages pointed by the vring descriptors...
 
> .John
 
--
Sincerely yours,
Mike.



Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-13 Thread Mike Rapoport
On Mon, Dec 12, 2016 at 04:10:26PM +0100, Jesper Dangaard Brouer wrote:
> On Mon, 12 Dec 2016 16:14:33 +0200
> Mike Rapoport  wrote:
> > 
> > They are copied :-)
> > Presuming we are dealing only with vhost backend, the received skb
> > eventually gets converted to IOVs, which in turn are copied to the guest
> > memory. The IOVs point to the guest memory that is allocated by virtio-net
> > running in the guest.
> 
> Thanks for explaining that. It seems like a lot of overhead. I have to
> wrap my head around this... so, the hardware NIC is receiving the
> packet/page, in the RX ring, and after converting it to IOVs, it is
> conceptually transmitted into the guest, and then the guest-side have a
> RX-function to handle this packet. Correctly understood?

Almost :)
For the hardware NIC driver, the receive just follows the "normal" path.
It creates an skb for the packet and passes it to the net core RX. Then the
skb is delivered to tap/macvtap. The later converts the skb to IOVs and
IOVs are pushed to the guest address space.

On the guest side, virtio-net sees these IOVs as a part of its RX ring, it
creates an skb for the packet and passes the skb to the net core of the
guest.

> > I'm not very familiar with XDP eBPF, and it's difficult for me to estimate
> > what needs to be done in BPF program to do proper conversion of skb to the
> > virtio descriptors.
> 
> XDP is a step _before_ the SKB is allocated.  The XDP eBPF program can
> modify the packet-page data, but I don't think it is needed for your
> use-case.  View XDP (primarily) as an early (demux) filter.
> 
> XDP is missing a feature your need, which is TX packet into another
> net_device (I actually imagine a port mapping table, that point to a
> net_device).  This require a new "TX-raw" NDO that takes a page (+
> offset and length). 
> 
> I imagine, the virtio driver (virtio_net or a new driver?) getting
> extended with this new "TX-raw" NDO, that takes "raw" packet-pages.
>  Whether zero-copy is possible is determined by checking if page
> originates from a page_pool that have enabled zero-copy (and likely
> matching against a "protection domain" id number).
 
That could be quite a few drivers that will need to implement "TX-raw" then
:)
In general case, the virtual NIC may be connected to the physical network
via long chain of virtual devices such as bridge, veth and ovs.
Actually, because of that we wanted to concentrate on macvtap...
 
> > We were not considered using XDP yet, so we've decided to limit the initial
> > implementation to macvtap because we can ensure correspondence between a
> > NIC queue and virtual NIC, which is not the case with more generic tap
> > device. It could be that use of XDP will allow for a generic solution for
> > virtio case as well.
> 
> You don't need an XDP filter, if you can make the HW do the early demux
> binding into a queue.  The check for if memory is zero-copy enabled
> would be the same.
> 
> > >   
> > > > Have you considered using "push" model for setting the NIC's RX memory? 
> > > >  
> > > 
> > > I don't understand what you mean by a "push" model?  
> > 
> > Currently, memory allocation in NIC drivers boils down to alloc_page with
> > some wrapping code. I see two possible ways to make NIC use of some
> > preallocated pages: either NIC driver will call an API (probably different
> > from alloc_page) to obtain that memory, or there will be NDO API that
> > allows to set the NIC's RX buffers. I named the later case "push".
> 
> As you might have guessed, I'm not into the "push" model, because this
> means I cannot share the queue with the normal network stack.  Which I
> believe is possible as outlined (in email and [2]) and can be done with
> out HW filter features (like macvlan).

I think I should sleep on it a bit more :)
Probably we can add page_pool "backend" implementation to vhost...

--
Sincerely yours,
Mike. 



Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-12 Thread Christoph Lameter
On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote:

> Hmmm. If you can rely on hardware setup to give you steering and
> dedicated access to the RX rings.  In those cases, I guess, the "push"
> model could be a more direct API approach.

If the hardware does not support steering then one should be able to
provide those services in software.

> I was shooting for a model that worked without hardware support.  And
> then transparently benefit from HW support by configuring a HW filter
> into a specific RX queue and attaching/using to that queue.

The discussion here is a bit amusing since these issues have been resolved
a long time ago with the design of the RDMA subsystem. Zero copy is
already in wide use. Memory registration is used to pin down memory areas.
Work requests can be filed with the RDMA subsystem that then send and
receive packets from the registered memory regions. This is not strictly
remote memory access but this is a basic mode of operations supported  by
the RDMA subsystem. The mlx5 driver quoted here supports all of that.

What is bad about RDMA is that it is a separate kernel subsystem. What I
would like to see is a deeper integration with the network stack so that
memory regions can be registred with a network socket and work requests
then can be submitted and processed that directly read and write in these
regions. The network stack should provide the services that the hardware
of the NIC does not suppport as usual.

The RX/TX ring in user space should be an additional mode of operation of
the socket layer. Once that is in place the "Remote memory acces" can be
trivially implemented on top of that and the ugly RDMA sidecar subsystem
can go away.



Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-12 Thread Jesper Dangaard Brouer
On Mon, 12 Dec 2016 06:49:03 -0800
John Fastabend  wrote:

> On 16-12-12 06:14 AM, Mike Rapoport wrote:
> > On Mon, Dec 12, 2016 at 10:40:42AM +0100, Jesper Dangaard Brouer wrote:  
> >>
> >> On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoport  
> >> wrote:
> >>  
> >>> Hello Jesper,
> >>>
> >>> On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote:  
>  Hi all,
> 
>  This is my design for how to safely handle RX zero-copy in the network
>  stack, by using page_pool[1] and modifying NIC drivers.  Safely means
>  not leaking kernel info in pages mapped to userspace and resilience
>  so a malicious userspace app cannot crash the kernel.
> 
>  Design target
>  =
> 
>  Allow the NIC to function as a normal Linux NIC and be shared in a
>  safe manor, between the kernel network stack and an accelerated
>  userspace application using RX zero-copy delivery.
> 
>  Target is to provide the basis for building RX zero-copy solutions in
>  a memory safe manor.  An efficient communication channel for userspace
>  delivery is out of scope for this document, but OOM considerations are
>  discussed below (`Userspace delivery and OOM`_).
> >>>
> >>> Sorry, if this reply is a bit off-topic.  
> >>
> >> It is very much on topic IMHO :-)
> >>  
> >>> I'm working on implementation of RX zero-copy for virtio and I've 
> >>> dedicated
> >>> some thought about making guest memory available for physical NIC DMAs.
> >>> I believe this is quite related to your page_pool proposal, at least from
> >>> the NIC driver perspective, so I'd like to share some thoughts here.  
> >>
> >> Seems quite related. I'm very interested in cooperating with you! I'm
> >> not very familiar with virtio, and how packets/pages gets channeled
> >> into virtio.  
> > 
> > They are copied :-)
> > Presuming we are dealing only with vhost backend, the received skb
> > eventually gets converted to IOVs, which in turn are copied to the guest
> > memory. The IOVs point to the guest memory that is allocated by virtio-net
> > running in the guest.
> >   
> 
> Great I'm also doing something similar.
> 
> My plan was to embed the zero copy as an AF_PACKET mode and then push
> a AF_PACKET backend into vhost. I'll post a patch later this week.
> 
> >>> The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g.
> >>> using macvtap, and then propagate guest RX memory allocations to the NIC
> >>> using something like new .ndo_set_rx_buffers method.  
> >>
> >> I believe the page_pool API/design aligns with this idea/use-case.
> >>  
> >>> What is your view about interface between the page_pool and the NIC
> >>> drivers?  
> >>
> >> In my Prove-of-Concept implementation, the NIC driver (mlx5) register
> >> a page_pool per RX queue.  This is done for two reasons (1) performance
> >> and (2) for supporting use-cases where only one single RX-ring queue is
> >> (re)configured to support RX-zero-copy.  There are some associated
> >> extra cost of enabling this mode, thus it makes sense to only enable it
> >> when needed.
> >>
> >> I've not decided how this gets enabled, maybe some new driver NDO.  It
> >> could also happen when a XDP program gets loaded, which request this
> >> feature.
> >>
> >> The macvtap solution is nice and we should support it, but it requires
> >> VM to have their MAC-addr registered on the physical switch.  This
> >> design is about adding flexibility. Registering an XDP eBPF filter
> >> provides the maximum flexibility for matching the destination VM.  
> > 
> > I'm not very familiar with XDP eBPF, and it's difficult for me to estimate
> > what needs to be done in BPF program to do proper conversion of skb to the
> > virtio descriptors.  
> 
> I don't think XDP has much to do with this code and they should be done
> separately. XDP runs eBPF code on received packets after the DMA engine
> has already placed the packet in memory so its too late in the process.

It does not have to be connected to XDP.  My idea should support RX
zero-copy into normal sockets, without XDP.

My idea was to pre-VMA map the RX ring, when zero-copy is requested,
thus it is not too late in the process.  When frame travel the normal
network stack, then require the SKB-read-only-page mode (skb-frags).
If the SKB reach a socket that support zero-copy, then we can do RX
zero-copy on normal sockets.

 
> The other piece here is enabling XDP in vhost but that is again separate
> IMO.
> 
> Notice that ixgbe supports pushing packets into a macvlan via 'tc'
> traffic steering commands so even though macvlan gets an L2 address it
> doesn't mean it can't use other criteria to steer traffic to it.

This sounds interesting. As this allow much more flexibility macvlan
matching, which I like, but still depending on HW support. 

 
> > We were not considered using XDP yet, so we've decided to limit the initial
> > implementation 

Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-12 Thread Jesper Dangaard Brouer
On Mon, 12 Dec 2016 16:14:33 +0200
Mike Rapoport  wrote:

> On Mon, Dec 12, 2016 at 10:40:42AM +0100, Jesper Dangaard Brouer wrote:
> > 
> > On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoport  
> > wrote:
> >   
> > > Hello Jesper,
> > > 
> > > On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote:  
> > > > Hi all,
> > > > 
> > > > This is my design for how to safely handle RX zero-copy in the network
> > > > stack, by using page_pool[1] and modifying NIC drivers.  Safely means
> > > > not leaking kernel info in pages mapped to userspace and resilience
> > > > so a malicious userspace app cannot crash the kernel.
> > > > 
> > > > Design target
> > > > =
> > > > 
> > > > Allow the NIC to function as a normal Linux NIC and be shared in a
> > > > safe manor, between the kernel network stack and an accelerated
> > > > userspace application using RX zero-copy delivery.
> > > > 
> > > > Target is to provide the basis for building RX zero-copy solutions in
> > > > a memory safe manor.  An efficient communication channel for userspace
> > > > delivery is out of scope for this document, but OOM considerations are
> > > > discussed below (`Userspace delivery and OOM`_).
> > > 
> > > Sorry, if this reply is a bit off-topic.  
> > 
> > It is very much on topic IMHO :-)
> >   
> > > I'm working on implementation of RX zero-copy for virtio and I've 
> > > dedicated
> > > some thought about making guest memory available for physical NIC DMAs.
> > > I believe this is quite related to your page_pool proposal, at least from
> > > the NIC driver perspective, so I'd like to share some thoughts here.  
> > 
> > Seems quite related. I'm very interested in cooperating with you! I'm
> > not very familiar with virtio, and how packets/pages gets channeled
> > into virtio.  
> 
> They are copied :-)
> Presuming we are dealing only with vhost backend, the received skb
> eventually gets converted to IOVs, which in turn are copied to the guest
> memory. The IOVs point to the guest memory that is allocated by virtio-net
> running in the guest.

Thanks for explaining that. It seems like a lot of overhead. I have to
wrap my head around this... so, the hardware NIC is receiving the
packet/page, in the RX ring, and after converting it to IOVs, it is
conceptually transmitted into the guest, and then the guest-side have a
RX-function to handle this packet. Correctly understood?

 
> > > The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g.
> > > using macvtap, and then propagate guest RX memory allocations to the NIC
> > > using something like new .ndo_set_rx_buffers method.  
> > 
> > I believe the page_pool API/design aligns with this idea/use-case.
> >   
> > > What is your view about interface between the page_pool and the NIC
> > > drivers?  
> > 
> > In my Prove-of-Concept implementation, the NIC driver (mlx5) register
> > a page_pool per RX queue.  This is done for two reasons (1) performance
> > and (2) for supporting use-cases where only one single RX-ring queue is
> > (re)configured to support RX-zero-copy.  There are some associated
> > extra cost of enabling this mode, thus it makes sense to only enable it
> > when needed.
> > 
> > I've not decided how this gets enabled, maybe some new driver NDO.  It
> > could also happen when a XDP program gets loaded, which request this
> > feature.
> > 
> > The macvtap solution is nice and we should support it, but it requires
> > VM to have their MAC-addr registered on the physical switch.  This
> > design is about adding flexibility. Registering an XDP eBPF filter
> > provides the maximum flexibility for matching the destination VM.  
> 
> I'm not very familiar with XDP eBPF, and it's difficult for me to estimate
> what needs to be done in BPF program to do proper conversion of skb to the
> virtio descriptors.

XDP is a step _before_ the SKB is allocated.  The XDP eBPF program can
modify the packet-page data, but I don't think it is needed for your
use-case.  View XDP (primarily) as an early (demux) filter.

XDP is missing a feature your need, which is TX packet into another
net_device (I actually imagine a port mapping table, that point to a
net_device).  This require a new "TX-raw" NDO that takes a page (+
offset and length). 

I imagine, the virtio driver (virtio_net or a new driver?) getting
extended with this new "TX-raw" NDO, that takes "raw" packet-pages.
 Whether zero-copy is possible is determined by checking if page
originates from a page_pool that have enabled zero-copy (and likely
matching against a "protection domain" id number).


> We were not considered using XDP yet, so we've decided to limit the initial
> implementation to macvtap because we can ensure correspondence between a
> NIC queue and virtual NIC, which is not the case with more generic tap
> device. It could be that use of XDP will allow for a generic solution for
> virtio case as well.

You don't need an XDP 

Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-12 Thread John Fastabend
On 16-12-12 06:14 AM, Mike Rapoport wrote:
> On Mon, Dec 12, 2016 at 10:40:42AM +0100, Jesper Dangaard Brouer wrote:
>>
>> On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoport  
>> wrote:
>>
>>> Hello Jesper,
>>>
>>> On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote:
 Hi all,

 This is my design for how to safely handle RX zero-copy in the network
 stack, by using page_pool[1] and modifying NIC drivers.  Safely means
 not leaking kernel info in pages mapped to userspace and resilience
 so a malicious userspace app cannot crash the kernel.

 Design target
 =

 Allow the NIC to function as a normal Linux NIC and be shared in a
 safe manor, between the kernel network stack and an accelerated
 userspace application using RX zero-copy delivery.

 Target is to provide the basis for building RX zero-copy solutions in
 a memory safe manor.  An efficient communication channel for userspace
 delivery is out of scope for this document, but OOM considerations are
 discussed below (`Userspace delivery and OOM`_).  
>>>
>>> Sorry, if this reply is a bit off-topic.
>>
>> It is very much on topic IMHO :-)
>>
>>> I'm working on implementation of RX zero-copy for virtio and I've dedicated
>>> some thought about making guest memory available for physical NIC DMAs.
>>> I believe this is quite related to your page_pool proposal, at least from
>>> the NIC driver perspective, so I'd like to share some thoughts here.
>>
>> Seems quite related. I'm very interested in cooperating with you! I'm
>> not very familiar with virtio, and how packets/pages gets channeled
>> into virtio.
> 
> They are copied :-)
> Presuming we are dealing only with vhost backend, the received skb
> eventually gets converted to IOVs, which in turn are copied to the guest
> memory. The IOVs point to the guest memory that is allocated by virtio-net
> running in the guest.
> 

Great I'm also doing something similar.

My plan was to embed the zero copy as an AF_PACKET mode and then push
a AF_PACKET backend into vhost. I'll post a patch later this week.

>>> The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g.
>>> using macvtap, and then propagate guest RX memory allocations to the NIC
>>> using something like new .ndo_set_rx_buffers method.
>>
>> I believe the page_pool API/design aligns with this idea/use-case.
>>
>>> What is your view about interface between the page_pool and the NIC
>>> drivers?
>>
>> In my Prove-of-Concept implementation, the NIC driver (mlx5) register
>> a page_pool per RX queue.  This is done for two reasons (1) performance
>> and (2) for supporting use-cases where only one single RX-ring queue is
>> (re)configured to support RX-zero-copy.  There are some associated
>> extra cost of enabling this mode, thus it makes sense to only enable it
>> when needed.
>>
>> I've not decided how this gets enabled, maybe some new driver NDO.  It
>> could also happen when a XDP program gets loaded, which request this
>> feature.
>>
>> The macvtap solution is nice and we should support it, but it requires
>> VM to have their MAC-addr registered on the physical switch.  This
>> design is about adding flexibility. Registering an XDP eBPF filter
>> provides the maximum flexibility for matching the destination VM.
> 
> I'm not very familiar with XDP eBPF, and it's difficult for me to estimate
> what needs to be done in BPF program to do proper conversion of skb to the
> virtio descriptors.

I don't think XDP has much to do with this code and they should be done
separately. XDP runs eBPF code on received packets after the DMA engine
has already placed the packet in memory so its too late in the process.

The other piece here is enabling XDP in vhost but that is again separate
IMO.

Notice that ixgbe supports pushing packets into a macvlan via 'tc'
traffic steering commands so even though macvlan gets an L2 address it
doesn't mean it can't use other criteria to steer traffic to it.

> 
> We were not considered using XDP yet, so we've decided to limit the initial
> implementation to macvtap because we can ensure correspondence between a
> NIC queue and virtual NIC, which is not the case with more generic tap
> device. It could be that use of XDP will allow for a generic solution for
> virtio case as well.

Interesting this was one of the original ideas behind the macvlan
offload mode. iirc Vlad also was interested in this.

I'm guessing this was used because of the ability to push macvlan onto
its own queue?

>  
>>
>>> Have you considered using "push" model for setting the NIC's RX memory?
>>
>> I don't understand what you mean by a "push" model?
> 
> Currently, memory allocation in NIC drivers boils down to alloc_page with
> some wrapping code. I see two possible ways to make NIC use of some
> preallocated pages: either NIC driver will call an API (probably different
> from alloc_page) to obtain that memory, or there 

Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-12 Thread Mike Rapoport
On Mon, Dec 12, 2016 at 10:40:42AM +0100, Jesper Dangaard Brouer wrote:
> 
> On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoport  
> wrote:
> 
> > Hello Jesper,
> > 
> > On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote:
> > > Hi all,
> > > 
> > > This is my design for how to safely handle RX zero-copy in the network
> > > stack, by using page_pool[1] and modifying NIC drivers.  Safely means
> > > not leaking kernel info in pages mapped to userspace and resilience
> > > so a malicious userspace app cannot crash the kernel.
> > > 
> > > Design target
> > > =
> > > 
> > > Allow the NIC to function as a normal Linux NIC and be shared in a
> > > safe manor, between the kernel network stack and an accelerated
> > > userspace application using RX zero-copy delivery.
> > > 
> > > Target is to provide the basis for building RX zero-copy solutions in
> > > a memory safe manor.  An efficient communication channel for userspace
> > > delivery is out of scope for this document, but OOM considerations are
> > > discussed below (`Userspace delivery and OOM`_).  
> > 
> > Sorry, if this reply is a bit off-topic.
> 
> It is very much on topic IMHO :-)
> 
> > I'm working on implementation of RX zero-copy for virtio and I've dedicated
> > some thought about making guest memory available for physical NIC DMAs.
> > I believe this is quite related to your page_pool proposal, at least from
> > the NIC driver perspective, so I'd like to share some thoughts here.
> 
> Seems quite related. I'm very interested in cooperating with you! I'm
> not very familiar with virtio, and how packets/pages gets channeled
> into virtio.

They are copied :-)
Presuming we are dealing only with vhost backend, the received skb
eventually gets converted to IOVs, which in turn are copied to the guest
memory. The IOVs point to the guest memory that is allocated by virtio-net
running in the guest.

> > The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g.
> > using macvtap, and then propagate guest RX memory allocations to the NIC
> > using something like new .ndo_set_rx_buffers method.
> 
> I believe the page_pool API/design aligns with this idea/use-case.
> 
> > What is your view about interface between the page_pool and the NIC
> > drivers?
> 
> In my Prove-of-Concept implementation, the NIC driver (mlx5) register
> a page_pool per RX queue.  This is done for two reasons (1) performance
> and (2) for supporting use-cases where only one single RX-ring queue is
> (re)configured to support RX-zero-copy.  There are some associated
> extra cost of enabling this mode, thus it makes sense to only enable it
> when needed.
> 
> I've not decided how this gets enabled, maybe some new driver NDO.  It
> could also happen when a XDP program gets loaded, which request this
> feature.
> 
> The macvtap solution is nice and we should support it, but it requires
> VM to have their MAC-addr registered on the physical switch.  This
> design is about adding flexibility. Registering an XDP eBPF filter
> provides the maximum flexibility for matching the destination VM.

I'm not very familiar with XDP eBPF, and it's difficult for me to estimate
what needs to be done in BPF program to do proper conversion of skb to the
virtio descriptors.

We were not considered using XDP yet, so we've decided to limit the initial
implementation to macvtap because we can ensure correspondence between a
NIC queue and virtual NIC, which is not the case with more generic tap
device. It could be that use of XDP will allow for a generic solution for
virtio case as well.
 
> 
> > Have you considered using "push" model for setting the NIC's RX memory?
> 
> I don't understand what you mean by a "push" model?

Currently, memory allocation in NIC drivers boils down to alloc_page with
some wrapping code. I see two possible ways to make NIC use of some
preallocated pages: either NIC driver will call an API (probably different
from alloc_page) to obtain that memory, or there will be NDO API that
allows to set the NIC's RX buffers. I named the later case "push".
 
--
Sincerely yours,
Mike.



Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-12 Thread Jesper Dangaard Brouer

On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoport  
wrote:

> Hello Jesper,
> 
> On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote:
> > Hi all,
> > 
> > This is my design for how to safely handle RX zero-copy in the network
> > stack, by using page_pool[1] and modifying NIC drivers.  Safely means
> > not leaking kernel info in pages mapped to userspace and resilience
> > so a malicious userspace app cannot crash the kernel.
> > 
> > Design target
> > =
> > 
> > Allow the NIC to function as a normal Linux NIC and be shared in a
> > safe manor, between the kernel network stack and an accelerated
> > userspace application using RX zero-copy delivery.
> > 
> > Target is to provide the basis for building RX zero-copy solutions in
> > a memory safe manor.  An efficient communication channel for userspace
> > delivery is out of scope for this document, but OOM considerations are
> > discussed below (`Userspace delivery and OOM`_).  
> 
> Sorry, if this reply is a bit off-topic.

It is very much on topic IMHO :-)

> I'm working on implementation of RX zero-copy for virtio and I've dedicated
> some thought about making guest memory available for physical NIC DMAs.
> I believe this is quite related to your page_pool proposal, at least from
> the NIC driver perspective, so I'd like to share some thoughts here.

Seems quite related. I'm very interested in cooperating with you! I'm
not very familiar with virtio, and how packets/pages gets channeled
into virtio.

> The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g.
> using macvtap, and then propagate guest RX memory allocations to the NIC
> using something like new .ndo_set_rx_buffers method.

I believe the page_pool API/design aligns with this idea/use-case.

> What is your view about interface between the page_pool and the NIC
> drivers?

In my Prove-of-Concept implementation, the NIC driver (mlx5) register
a page_pool per RX queue.  This is done for two reasons (1) performance
and (2) for supporting use-cases where only one single RX-ring queue is
(re)configured to support RX-zero-copy.  There are some associated
extra cost of enabling this mode, thus it makes sense to only enable it
when needed.

I've not decided how this gets enabled, maybe some new driver NDO.  It
could also happen when a XDP program gets loaded, which request this
feature.

The macvtap solution is nice and we should support it, but it requires
VM to have their MAC-addr registered on the physical switch.  This
design is about adding flexibility. Registering an XDP eBPF filter
provides the maximum flexibility for matching the destination VM.


> Have you considered using "push" model for setting the NIC's RX memory?

I don't understand what you mean by a "push" model?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-12 Thread Mike Rapoport
Hello Jesper,

On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote:
> Hi all,
> 
> This is my design for how to safely handle RX zero-copy in the network
> stack, by using page_pool[1] and modifying NIC drivers.  Safely means
> not leaking kernel info in pages mapped to userspace and resilience
> so a malicious userspace app cannot crash the kernel.
> 
> Design target
> =
> 
> Allow the NIC to function as a normal Linux NIC and be shared in a
> safe manor, between the kernel network stack and an accelerated
> userspace application using RX zero-copy delivery.
> 
> Target is to provide the basis for building RX zero-copy solutions in
> a memory safe manor.  An efficient communication channel for userspace
> delivery is out of scope for this document, but OOM considerations are
> discussed below (`Userspace delivery and OOM`_).

Sorry, if this reply is a bit off-topic.

I'm working on implementation of RX zero-copy for virtio and I've dedicated
some thought about making guest memory available for physical NIC DMAs.
I believe this is quite related to your page_pool proposal, at least from
the NIC driver perspective, so I'd like to share some thoughts here.
The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g.
using macvtap, and then propagate guest RX memory allocations to the NIC
using something like new .ndo_set_rx_buffers method.

What is your view about interface between the page_pool and the NIC
drivers?
Have you considered using "push" model for setting the NIC's RX memory?

> 
> --
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 
> Above document is taken at GitHub commit 47fa7c844f48fab8b
>  https://github.com/netoptimizer/prototype-kernel/commit/47fa7c844f48fab8b
> 

--
Sincerely yours,
Mike.



Designing a safe RX-zero-copy Memory Model for Networking

2016-12-05 Thread Jesper Dangaard Brouer
Hi all,

This is my design for how to safely handle RX zero-copy in the network
stack, by using page_pool[1] and modifying NIC drivers.  Safely means
not leaking kernel info in pages mapped to userspace and resilience
so a malicious userspace app cannot crash the kernel.

It is only a design, and thus the purpose is for you to find any holes
in this design ;-)  Below text is also available as html see[2].

[1] 
https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/design.html
[2] 
https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html

===
Memory Model for Networking
===

This design describes how the page_pool change the memory model for
networking in the NIC (Network Interface Card) drivers.

.. Note:: The catch for driver developers is that, once an application
  request zero-copy RX, then the driver must use a specific
  SKB allocation mode and might have to reconfigure the
  RX-ring.

Design target
=

Allow the NIC to function as a normal Linux NIC and be shared in a
safe manor, between the kernel network stack and an accelerated
userspace application using RX zero-copy delivery.

Target is to provide the basis for building RX zero-copy solutions in
a memory safe manor.  An efficient communication channel for userspace
delivery is out of scope for this document, but OOM considerations are
discussed below (`Userspace delivery and OOM`_).

Background
==

The SKB or ``struct sk_buff`` is the fundamental meta-data structure
for network packets in the Linux Kernel network stack.  It is a fairly
complex object and can be constructed in several ways.

>From a memory perspective there are two ways depending on
RX-buffer/page state:

1) Writable packet page
2) Read-only packet page

To take full potential of the page_pool, the drivers must actually
support handling both options depending on the configuration state of
the page_pool.

Writable packet page


When the RX packet page is writable, the SKB setup is fairly straight
forward.  The SKB->data (and skb->head) can point directly to the page
data, adjusting the offset according to drivers headroom (for adding
headers) and setting the length according to the DMA descriptor info.

The page/data need to be writable, because the network stack need to
adjust headers (like TimeToLive and checksum) or even add or remove
headers for encapsulation purposes.

A subtle catch, which also requires a writable page, is that the SKB
also have an accompanying "shared info" data-structure ``struct
skb_shared_info``.  This "skb_shared_info" is written into the
skb->data memory area at the end (skb->end) of the (header) data.  The
skb_shared_info contains semi-sensitive information, like kernel
memory pointers to other pages (which might be pointers to more packet
data).  This would be bad from a zero-copy point of view to leak this
kind of information.

Read-only packet page
-

When the RX packet page is read-only, the construction of the SKB is
significantly more complicated and even involves one more memory
allocation.

1) Allocate a new separate writable memory area, and point skb->data
   here.  This is needed due to (above described) skb_shared_info.

2) Memcpy packet headers into this (skb->data) area.

3) Clear part of skb_shared_info struct in writable-area.

4) Setup pointer to packet-data in the page (in skb_shared_info->frags)
   and adjust the page_offset to be past the headers just copied.

It is useful (later) that the network stack have this notion that part
of the packet and a page can be read-only.  This implies that the
kernel will not "pollute" this memory with any sensitive information.
This is good from a zero-copy point of view, but bad from a
performance perspective.


NIC RX Zero-Copy


Doing NIC RX zero-copy involves mapping RX pages into userspace.  This
involves costly mapping and unmapping operations in the address space
of the userspace process.  Plus for doing this safely, the page memory
need to be cleared before using it, to avoid leaking kernel
information to userspace, also a costly operation.  The page_pool base
"class" of optimization is moving these kind of operations out of the
fastpath, by recycling and lifetime control.

Once a NIC RX-queue's page_pool have been configured for zero-copy
into userspace, then can packets still be allowed to travel the normal
stack?

Yes, this should be possible, because the driver can use the
SKB-read-only mode, which avoids polluting the page data with
kernel-side sensitive data.  This implies, when a driver RX-queue
switch page_pool to RX-zero-copy mode it MUST also switch to
SKB-read-only mode (for normal stack delivery for this RXq).

XDP can be used for controlling which pages that gets RX zero-copied
to userspace.  The page is still writable for the XDP program, but
read-only for normal stack delivery.