Re: Designing a safe RX-zero-copy Memory Model for Networking
On Thu, 15 Dec 2016, Jesper Dangaard Brouer wrote: > > It sounds like Christoph's RDMA approach might be the way to go. > > I'm getting more and more fond of Christoph's RDMA approach. I do > think we will end-up with something close to that approach. I just > wanted to get review on my idea first. > > IMHO the major blocker for the RDMA approach is not HW filters > themselves, but a common API that applications can call to register > what goes into the HW queues in the driver. I suspect it will be a > long project agreeing between vendors. And agreeing on semantics. Some of the methods from the RDMA subsystem (like queue pairs, the various queues etc) could be extracted and used here. Multiple vendors already support these features and some devices operate both in an RDMA and a network stack mode. Having that all supported by the networks stack would reduce overhead for those vendors. Multiple new vendors are coming up in the RDMA subsystem because the regular network stack does not have the right performance for high speed networking. I would rather see them have a way to get that functionality from the regular network stack. Please add some extensions so that the RDMA style I/O can be made to work. Even the hardware of the new NICs is already prepared to work with the data structures of the RDMA subsystem. That provides an area of standardization where we could hook into but do that properly and in a nice way in the context of main stream network support.
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Thu, Dec 15, 2016 at 12:28 AM, Jesper Dangaard Brouerwrote: > On Wed, 14 Dec 2016 14:45:00 -0800 > Alexander Duyck wrote: > >> On Wed, Dec 14, 2016 at 1:29 PM, Jesper Dangaard Brouer >> wrote: >> > On Wed, 14 Dec 2016 08:45:08 -0800 >> > Alexander Duyck wrote: >> > >> >> I agree. This is a no-go from the performance perspective as well. >> >> At a minimum you would have to be zeroing out the page between uses to >> >> avoid leaking data, and that assumes that the program we are sending >> >> the pages to is slightly well behaved. If we think zeroing out an >> >> sk_buff is expensive wait until we are trying to do an entire 4K page. >> > >> > Again, yes the page will be zero'ed out, but only when entering the >> > page_pool. Because they are recycled they are not cleared on every use. >> > Thus, performance does not suffer. >> >> So you are talking about recycling, but not clearing the page when it >> is recycled. That right there is my problem with this. It is fine if >> you assume the pages are used by the application only, but you are >> talking about using them for both the application and for the regular >> network path. You can't do that. If you are recycling you will have >> to clear the page every time you put it back onto the Rx ring, >> otherwise you can leak the recycled memory into user space and end up >> with a user space program being able to snoop data out of the skb. >> >> > Besides clearing large mem area is not as bad as clearing small. >> > Clearing an entire page does cost something, as mentioned before 143 >> > cycles, which is 28 bytes-per-cycle (4096/143). And clearing 256 bytes >> > cost 36 cycles which is only 7 bytes-per-cycle (256/36). >> >> What I am saying is that you are going to be clearing the 4K blocks >> each time they are recycled. You can't have the pages shared between >> user-space and the network stack unless you have true isolation. If >> you are allowing network stack pages to be recycled back into the >> user-space application you open up all sorts of leaks where the >> application can snoop into data it shouldn't have access to. > > See later, the "Read-only packet page" mode should provide a mode where > the netstack doesn't write into the page, and thus cannot leak kernel > data. (CAP_NET_ADMIN already give it access to other applications data.) I think you are kind of missing the point. The device is writing to the page on the kernel's behalf. Therefore the page isn't "Read-only" and you have an issue since you are talking about sharing a ring between kernel and userspace. >> >> I think we are stuck with having to use a HW filter to split off >> >> application traffic to a specific ring, and then having to share the >> >> memory between the application and the kernel on that ring only. Any >> >> other approach just opens us up to all sorts of security concerns >> >> since it would be possible for the application to try to read and >> >> possibly write any data it wants into the buffers. >> > >> > This is why I wrote a document[1], trying to outline how this is possible, >> > going through all the combinations, and asking the community to find >> > faults in my idea. Inlining it again, as nobody really replied on the >> > content of the doc. >> > >> > - >> > Best regards, >> > Jesper Dangaard Brouer >> > MSc.CS, Principal Kernel Engineer at Red Hat >> > LinkedIn: http://www.linkedin.com/in/brouer >> > >> > [1] >> > https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html >> > >> > === >> > Memory Model for Networking >> > === >> > >> > This design describes how the page_pool change the memory model for >> > networking in the NIC (Network Interface Card) drivers. >> > >> > .. Note:: The catch for driver developers is that, once an application >> > request zero-copy RX, then the driver must use a specific >> > SKB allocation mode and might have to reconfigure the >> > RX-ring. >> > >> > >> > Design target >> > = >> > >> > Allow the NIC to function as a normal Linux NIC and be shared in a >> > safe manor, between the kernel network stack and an accelerated >> > userspace application using RX zero-copy delivery. >> > >> > Target is to provide the basis for building RX zero-copy solutions in >> > a memory safe manor. An efficient communication channel for userspace >> > delivery is out of scope for this document, but OOM considerations are >> > discussed below (`Userspace delivery and OOM`_). >> > >> > Background >> > == >> > >> > The SKB or ``struct sk_buff`` is the fundamental meta-data structure >> > for network packets in the Linux Kernel network stack. It is a fairly >> > complex object and can be constructed in several ways. >> > >> > From a memory perspective there are two ways depending on >> >
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Wed, 14 Dec 2016 14:45:00 -0800 Alexander Duyckwrote: > On Wed, Dec 14, 2016 at 1:29 PM, Jesper Dangaard Brouer > wrote: > > On Wed, 14 Dec 2016 08:45:08 -0800 > > Alexander Duyck wrote: > > > >> I agree. This is a no-go from the performance perspective as well. > >> At a minimum you would have to be zeroing out the page between uses to > >> avoid leaking data, and that assumes that the program we are sending > >> the pages to is slightly well behaved. If we think zeroing out an > >> sk_buff is expensive wait until we are trying to do an entire 4K page. > > > > Again, yes the page will be zero'ed out, but only when entering the > > page_pool. Because they are recycled they are not cleared on every use. > > Thus, performance does not suffer. > > So you are talking about recycling, but not clearing the page when it > is recycled. That right there is my problem with this. It is fine if > you assume the pages are used by the application only, but you are > talking about using them for both the application and for the regular > network path. You can't do that. If you are recycling you will have > to clear the page every time you put it back onto the Rx ring, > otherwise you can leak the recycled memory into user space and end up > with a user space program being able to snoop data out of the skb. > > > Besides clearing large mem area is not as bad as clearing small. > > Clearing an entire page does cost something, as mentioned before 143 > > cycles, which is 28 bytes-per-cycle (4096/143). And clearing 256 bytes > > cost 36 cycles which is only 7 bytes-per-cycle (256/36). > > What I am saying is that you are going to be clearing the 4K blocks > each time they are recycled. You can't have the pages shared between > user-space and the network stack unless you have true isolation. If > you are allowing network stack pages to be recycled back into the > user-space application you open up all sorts of leaks where the > application can snoop into data it shouldn't have access to. See later, the "Read-only packet page" mode should provide a mode where the netstack doesn't write into the page, and thus cannot leak kernel data. (CAP_NET_ADMIN already give it access to other applications data.) > >> I think we are stuck with having to use a HW filter to split off > >> application traffic to a specific ring, and then having to share the > >> memory between the application and the kernel on that ring only. Any > >> other approach just opens us up to all sorts of security concerns > >> since it would be possible for the application to try to read and > >> possibly write any data it wants into the buffers. > > > > This is why I wrote a document[1], trying to outline how this is possible, > > going through all the combinations, and asking the community to find > > faults in my idea. Inlining it again, as nobody really replied on the > > content of the doc. > > > > - > > Best regards, > > Jesper Dangaard Brouer > > MSc.CS, Principal Kernel Engineer at Red Hat > > LinkedIn: http://www.linkedin.com/in/brouer > > > > [1] > > https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html > > > > === > > Memory Model for Networking > > === > > > > This design describes how the page_pool change the memory model for > > networking in the NIC (Network Interface Card) drivers. > > > > .. Note:: The catch for driver developers is that, once an application > > request zero-copy RX, then the driver must use a specific > > SKB allocation mode and might have to reconfigure the > > RX-ring. > > > > > > Design target > > = > > > > Allow the NIC to function as a normal Linux NIC and be shared in a > > safe manor, between the kernel network stack and an accelerated > > userspace application using RX zero-copy delivery. > > > > Target is to provide the basis for building RX zero-copy solutions in > > a memory safe manor. An efficient communication channel for userspace > > delivery is out of scope for this document, but OOM considerations are > > discussed below (`Userspace delivery and OOM`_). > > > > Background > > == > > > > The SKB or ``struct sk_buff`` is the fundamental meta-data structure > > for network packets in the Linux Kernel network stack. It is a fairly > > complex object and can be constructed in several ways. > > > > From a memory perspective there are two ways depending on > > RX-buffer/page state: > > > > 1) Writable packet page > > 2) Read-only packet page > > > > To take full potential of the page_pool, the drivers must actually > > support handling both options depending on the configuration state of > > the page_pool. > > > > Writable packet page > > > > > > When the RX packet page is writable, the SKB setup is fairly straight > > forward. The SKB->data (and
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Wed, Dec 14, 2016 at 1:29 PM, Jesper Dangaard Brouerwrote: > On Wed, 14 Dec 2016 08:45:08 -0800 > Alexander Duyck wrote: > >> I agree. This is a no-go from the performance perspective as well. >> At a minimum you would have to be zeroing out the page between uses to >> avoid leaking data, and that assumes that the program we are sending >> the pages to is slightly well behaved. If we think zeroing out an >> sk_buff is expensive wait until we are trying to do an entire 4K page. > > Again, yes the page will be zero'ed out, but only when entering the > page_pool. Because they are recycled they are not cleared on every use. > Thus, performance does not suffer. So you are talking about recycling, but not clearing the page when it is recycled. That right there is my problem with this. It is fine if you assume the pages are used by the application only, but you are talking about using them for both the application and for the regular network path. You can't do that. If you are recycling you will have to clear the page every time you put it back onto the Rx ring, otherwise you can leak the recycled memory into user space and end up with a user space program being able to snoop data out of the skb. > Besides clearing large mem area is not as bad as clearing small. > Clearing an entire page does cost something, as mentioned before 143 > cycles, which is 28 bytes-per-cycle (4096/143). And clearing 256 bytes > cost 36 cycles which is only 7 bytes-per-cycle (256/36). What I am saying is that you are going to be clearing the 4K blocks each time they are recycled. You can't have the pages shared between user-space and the network stack unless you have true isolation. If you are allowing network stack pages to be recycled back into the user-space application you open up all sorts of leaks where the application can snoop into data it shouldn't have access to. >> I think we are stuck with having to use a HW filter to split off >> application traffic to a specific ring, and then having to share the >> memory between the application and the kernel on that ring only. Any >> other approach just opens us up to all sorts of security concerns >> since it would be possible for the application to try to read and >> possibly write any data it wants into the buffers. > > This is why I wrote a document[1], trying to outline how this is possible, > going through all the combinations, and asking the community to find > faults in my idea. Inlining it again, as nobody really replied on the > content of the doc. > > - > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer > > [1] > https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html > > === > Memory Model for Networking > === > > This design describes how the page_pool change the memory model for > networking in the NIC (Network Interface Card) drivers. > > .. Note:: The catch for driver developers is that, once an application > request zero-copy RX, then the driver must use a specific > SKB allocation mode and might have to reconfigure the > RX-ring. > > > Design target > = > > Allow the NIC to function as a normal Linux NIC and be shared in a > safe manor, between the kernel network stack and an accelerated > userspace application using RX zero-copy delivery. > > Target is to provide the basis for building RX zero-copy solutions in > a memory safe manor. An efficient communication channel for userspace > delivery is out of scope for this document, but OOM considerations are > discussed below (`Userspace delivery and OOM`_). > > Background > == > > The SKB or ``struct sk_buff`` is the fundamental meta-data structure > for network packets in the Linux Kernel network stack. It is a fairly > complex object and can be constructed in several ways. > > From a memory perspective there are two ways depending on > RX-buffer/page state: > > 1) Writable packet page > 2) Read-only packet page > > To take full potential of the page_pool, the drivers must actually > support handling both options depending on the configuration state of > the page_pool. > > Writable packet page > > > When the RX packet page is writable, the SKB setup is fairly straight > forward. The SKB->data (and skb->head) can point directly to the page > data, adjusting the offset according to drivers headroom (for adding > headers) and setting the length according to the DMA descriptor info. > > The page/data need to be writable, because the network stack need to > adjust headers (like TimeToLive and checksum) or even add or remove > headers for encapsulation purposes. > > A subtle catch, which also requires a writable page, is that the SKB > also have an accompanying "shared info" data-structure ``struct >
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Wed, 14 Dec 2016 08:45:08 -0800 Alexander Duyckwrote: > I agree. This is a no-go from the performance perspective as well. > At a minimum you would have to be zeroing out the page between uses to > avoid leaking data, and that assumes that the program we are sending > the pages to is slightly well behaved. If we think zeroing out an > sk_buff is expensive wait until we are trying to do an entire 4K page. Again, yes the page will be zero'ed out, but only when entering the page_pool. Because they are recycled they are not cleared on every use. Thus, performance does not suffer. Besides clearing large mem area is not as bad as clearing small. Clearing an entire page does cost something, as mentioned before 143 cycles, which is 28 bytes-per-cycle (4096/143). And clearing 256 bytes cost 36 cycles which is only 7 bytes-per-cycle (256/36). > I think we are stuck with having to use a HW filter to split off > application traffic to a specific ring, and then having to share the > memory between the application and the kernel on that ring only. Any > other approach just opens us up to all sorts of security concerns > since it would be possible for the application to try to read and > possibly write any data it wants into the buffers. This is why I wrote a document[1], trying to outline how this is possible, going through all the combinations, and asking the community to find faults in my idea. Inlining it again, as nobody really replied on the content of the doc. - Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer [1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html === Memory Model for Networking === This design describes how the page_pool change the memory model for networking in the NIC (Network Interface Card) drivers. .. Note:: The catch for driver developers is that, once an application request zero-copy RX, then the driver must use a specific SKB allocation mode and might have to reconfigure the RX-ring. Design target = Allow the NIC to function as a normal Linux NIC and be shared in a safe manor, between the kernel network stack and an accelerated userspace application using RX zero-copy delivery. Target is to provide the basis for building RX zero-copy solutions in a memory safe manor. An efficient communication channel for userspace delivery is out of scope for this document, but OOM considerations are discussed below (`Userspace delivery and OOM`_). Background == The SKB or ``struct sk_buff`` is the fundamental meta-data structure for network packets in the Linux Kernel network stack. It is a fairly complex object and can be constructed in several ways. >From a memory perspective there are two ways depending on RX-buffer/page state: 1) Writable packet page 2) Read-only packet page To take full potential of the page_pool, the drivers must actually support handling both options depending on the configuration state of the page_pool. Writable packet page When the RX packet page is writable, the SKB setup is fairly straight forward. The SKB->data (and skb->head) can point directly to the page data, adjusting the offset according to drivers headroom (for adding headers) and setting the length according to the DMA descriptor info. The page/data need to be writable, because the network stack need to adjust headers (like TimeToLive and checksum) or even add or remove headers for encapsulation purposes. A subtle catch, which also requires a writable page, is that the SKB also have an accompanying "shared info" data-structure ``struct skb_shared_info``. This "skb_shared_info" is written into the skb->data memory area at the end (skb->end) of the (header) data. The skb_shared_info contains semi-sensitive information, like kernel memory pointers to other pages (which might be pointers to more packet data). This would be bad from a zero-copy point of view to leak this kind of information. Read-only packet page - When the RX packet page is read-only, the construction of the SKB is significantly more complicated and even involves one more memory allocation. 1) Allocate a new separate writable memory area, and point skb->data here. This is needed due to (above described) skb_shared_info. 2) Memcpy packet headers into this (skb->data) area. 3) Clear part of skb_shared_info struct in writable-area. 4) Setup pointer to packet-data in the page (in skb_shared_info->frags) and adjust the page_offset to be past the headers just copied. It is useful (later) that the network stack have this notion that part of the packet and a page can be read-only. This implies that the kernel will not "pollute" this memory with any sensitive information. This is good from a zero-copy point of view, but
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Wed, 14 Dec 2016, Hannes Frederic Sowa wrote: > Wouldn't changing of the pages cause expensive TLB flushes? Yes so you would only want that feature if its realized at the page table level for debugging issues. Once you have memory registered with the hardware device then also the device could itself could perform snooping to realize that data was changed and thus abort the operation.
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Wed, 14 Dec 2016 08:32:10 -0800 John Fastabendwrote: > On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote: > > On Tue, 13 Dec 2016 12:08:21 -0800 > > John Fastabend wrote: > > > >> On 16-12-13 11:53 AM, David Miller wrote: > >>> From: John Fastabend > >>> Date: Tue, 13 Dec 2016 09:43:59 -0800 > >>> > What does "zero-copy send packet-pages to the application/socket that > requested this" mean? At the moment on x86 page-flipping appears to be > more expensive than memcpy (I can post some data shortly) and shared > memory was proposed and rejected for security reasons when we were > working on bifurcated driver. > >>> > >>> The whole idea is that we map all the active RX ring pages into > >>> userspace from the start. > >>> > >>> And just how Jesper's page pool work will avoid DMA map/unmap, > >>> it will also avoid changing the userspace mapping of the pages > >>> as well. > >>> > >>> Thus avoiding the TLB/VM overhead altogether. > >>> > > > > Exactly. It is worth mentioning that pages entering the page pool need > > to be cleared (measured cost 143 cycles), in order to not leak any > > kernel info. The primary focus of this design is to make sure not to > > leak kernel info to userspace, but with an "exclusive" mode also > > support isolation between applications. > > > > > >> I get this but it requires applications to be isolated. The pages from > >> a queue can not be shared between multiple applications in different > >> trust domains. And the application has to be cooperative meaning it > >> can't "look" at data that has not been marked by the stack as OK. In > >> these schemes we tend to end up with something like virtio/vhost or > >> af_packet. > > > > I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first > > two would require CAP_NET_ADMIN privileges. All modes have a trust > > domain id, that need to match e.g. when page reach the socket. > > Even mode 3 should required cap_net_admin we don't want userspace to > grab queues off the nic without it IMO. Good point. > > > > Mode-1 "Shared": Application choose lowest isolation level, allowing > > multiple application to mmap VMA area. > > My only point here is applications can read each others data and all > applications need to cooperate for example one app could try to write > continuously to read only pages causing faults and what not. This is > all non standard and doesn't play well with cgroups and "normal" > applications. It requires a new orchestration model. > > I'm a bit skeptical of the use case but I know of a handful of reasons > to use this model. Maybe take a look at the ivshmem implementation in > DPDK. > > Also this still requires a hardware filter to push "application" traffic > onto reserved queues/pages as far as I can tell. > > > > > Mode-2 "Single-user": Application request it want to be the only user > > of the RX queue. This blocks other application to mmap VMA area. > > > > Assuming data is read-only sharing with the stack is possibly OK :/. I > guess you would need to pools of memory for data and skb so you don't > leak skb into user space. Yes, as describe in orig email and here[1]: "once an application request zero-copy RX, then the driver must use a specific SKB allocation mode and might have to reconfigure the RX-ring." The SKB allocation mode is "read-only packet page", which is the current default mode (also desc in document[1]) of using skb-frags. [1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html > The devils in the details here. There are lots of hooks in the kernel > that can for example push the packet with a 'redirect' tc action for > example. And letting an app "read" data or impact performance of an > unrelated application is wrong IMO. Stacked devices also provide another > set of details that are a bit difficult to track down see all the > hardware offload efforts. > > I assume all these concerns are shared between mode-1 and mode-2 > > > Mode-3 "Exclusive": Application request to own RX queue. Packets are > > no longer allowed for normal netstack delivery. > > > > I have patches for this mode already but haven't pushed them due to > an alternative solution using VFIO. Interesting. > > Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are > > still allowed to travel netstack and thus can contain packet data from > > other normal applications. This is part of the design, to share the > > NIC between netstack and an accelerated userspace application using RX > > zero-copy delivery. > > > > I don't think this is acceptable to be honest. Letting an application > potentially read/impact other arbitrary applications on the system > seems like a non-starter even with CAP_NET_ADMIN. At least this was > the conclusion from bifurcated driver work some time ago. I
Re: Designing a safe RX-zero-copy Memory Model for Networking
On 14.12.2016 20:43, Christoph Lameter wrote: > On Wed, 14 Dec 2016, David Laight wrote: > >> If the kernel is doing ANY validation on the frames it must copy the >> data to memory the application cannot modify before doing the validation. >> Otherwise the application could change the data afterwards. > > The application is not allowed to change the data after a work request has > been submitted to send the frame. Changes are possible after the > completion request has been received. > > The kernel can enforce that by making the frame(s) readonly and thus > getting a page fault if the app would do such a thing. As far as I remember right now, if you gift with vmsplice the memory over a pipe to a tcp socket, you can in fact change the user data while the data is in transmit. So you should not touch the memory region until you received a SOF_TIMESTAMPING_TX_ACK error message in your sockets error queue or stuff might break horribly. I don't think we have a proper event for UDP that fires after we know the data left the hardware. In my opinion this is still fine within the kernel protection limits. E.g. due to scatter gather I/O you don't get access to the TCP header nor UDP header and thus can't e.g. spoof or modify the header or administration policies, albeit TOCTTOU races with netfilter which matches inside the TCP/UDP packets are very well possible on transmit. Wouldn't changing of the pages cause expensive TLB flushes? Bye, Hannes
RE: Designing a safe RX-zero-copy Memory Model for Networking
On Wed, 14 Dec 2016, David Laight wrote: > If the kernel is doing ANY validation on the frames it must copy the > data to memory the application cannot modify before doing the validation. > Otherwise the application could change the data afterwards. The application is not allowed to change the data after a work request has been submitted to send the frame. Changes are possible after the completion request has been received. The kernel can enforce that by making the frame(s) readonly and thus getting a page fault if the app would do such a thing.
RE: Designing a safe RX-zero-copy Memory Model for Networking
From: Christoph Lameter > Sent: 14 December 2016 17:00 > On Tue, 13 Dec 2016, Hannes Frederic Sowa wrote: > > > > Interesting. So you even imagine sockets registering memory regions > > > with the NIC. If we had a proper NIC HW filter API across the drivers, > > > to register the steering rule (like ibv_create_flow), this would be > > > doable, but we don't (DPDK actually have an interesting proposal[1]) > > > > On a side note, this is what windows does with RIO ("registered I/O"). > > Maybe you want to look at the API to get some ideas: allocating and > > pinning down memory in user space and registering that with sockets to > > get zero-copy IO. > > Yup that is also what I think. Regarding the memory registration and flow > steering for user space RX/TX ring please look at the qpair model > implemented by the RDMA subsystem in the kernel. The memory semantics are > clearly established there and have been in use for more than a decade. Isn't there a bigger problem for transmit? If the kernel is doing ANY validation on the frames it must copy the data to memory the application cannot modify before doing the validation. Otherwise the application could change the data afterwards. David
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Wed, Dec 14, 2016 at 8:32 AM, John Fastabendwrote: > On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote: >> On Tue, 13 Dec 2016 12:08:21 -0800 >> John Fastabend wrote: >> >>> On 16-12-13 11:53 AM, David Miller wrote: From: John Fastabend Date: Tue, 13 Dec 2016 09:43:59 -0800 > What does "zero-copy send packet-pages to the application/socket that > requested this" mean? At the moment on x86 page-flipping appears to be > more expensive than memcpy (I can post some data shortly) and shared > memory was proposed and rejected for security reasons when we were > working on bifurcated driver. The whole idea is that we map all the active RX ring pages into userspace from the start. And just how Jesper's page pool work will avoid DMA map/unmap, it will also avoid changing the userspace mapping of the pages as well. Thus avoiding the TLB/VM overhead altogether. >> >> Exactly. It is worth mentioning that pages entering the page pool need >> to be cleared (measured cost 143 cycles), in order to not leak any >> kernel info. The primary focus of this design is to make sure not to >> leak kernel info to userspace, but with an "exclusive" mode also >> support isolation between applications. >> >> >>> I get this but it requires applications to be isolated. The pages from >>> a queue can not be shared between multiple applications in different >>> trust domains. And the application has to be cooperative meaning it >>> can't "look" at data that has not been marked by the stack as OK. In >>> these schemes we tend to end up with something like virtio/vhost or >>> af_packet. >> >> I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first >> two would require CAP_NET_ADMIN privileges. All modes have a trust >> domain id, that need to match e.g. when page reach the socket. > > Even mode 3 should required cap_net_admin we don't want userspace to > grab queues off the nic without it IMO. > >> >> Mode-1 "Shared": Application choose lowest isolation level, allowing >> multiple application to mmap VMA area. > > My only point here is applications can read each others data and all > applications need to cooperate for example one app could try to write > continuously to read only pages causing faults and what not. This is > all non standard and doesn't play well with cgroups and "normal" > applications. It requires a new orchestration model. > > I'm a bit skeptical of the use case but I know of a handful of reasons > to use this model. Maybe take a look at the ivshmem implementation in > DPDK. > > Also this still requires a hardware filter to push "application" traffic > onto reserved queues/pages as far as I can tell. > >> >> Mode-2 "Single-user": Application request it want to be the only user >> of the RX queue. This blocks other application to mmap VMA area. >> > > Assuming data is read-only sharing with the stack is possibly OK :/. I > guess you would need to pools of memory for data and skb so you don't > leak skb into user space. > > The devils in the details here. There are lots of hooks in the kernel > that can for example push the packet with a 'redirect' tc action for > example. And letting an app "read" data or impact performance of an > unrelated application is wrong IMO. Stacked devices also provide another > set of details that are a bit difficult to track down see all the > hardware offload efforts. > > I assume all these concerns are shared between mode-1 and mode-2 > >> Mode-3 "Exclusive": Application request to own RX queue. Packets are >> no longer allowed for normal netstack delivery. >> > > I have patches for this mode already but haven't pushed them due to > an alternative solution using VFIO. > >> Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are >> still allowed to travel netstack and thus can contain packet data from >> other normal applications. This is part of the design, to share the >> NIC between netstack and an accelerated userspace application using RX >> zero-copy delivery. >> > > I don't think this is acceptable to be honest. Letting an application > potentially read/impact other arbitrary applications on the system > seems like a non-starter even with CAP_NET_ADMIN. At least this was > the conclusion from bifurcated driver work some time ago. I agree. This is a no-go from the performance perspective as well. At a minimum you would have to be zeroing out the page between uses to avoid leaking data, and that assumes that the program we are sending the pages to is slightly well behaved. If we think zeroing out an sk_buff is expensive wait until we are trying to do an entire 4K page. I think we are stuck with having to use a HW filter to split off application traffic to a specific ring, and then having to share the memory between the application and the kernel on that ring only. Any other
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Tue, 13 Dec 2016, Hannes Frederic Sowa wrote: > > Interesting. So you even imagine sockets registering memory regions > > with the NIC. If we had a proper NIC HW filter API across the drivers, > > to register the steering rule (like ibv_create_flow), this would be > > doable, but we don't (DPDK actually have an interesting proposal[1]) > > On a side note, this is what windows does with RIO ("registered I/O"). > Maybe you want to look at the API to get some ideas: allocating and > pinning down memory in user space and registering that with sockets to > get zero-copy IO. Yup that is also what I think. Regarding the memory registration and flow steering for user space RX/TX ring please look at the qpair model implemented by the RDMA subsystem in the kernel. The memory semantics are clearly established there and have been in use for more than a decade.
Re: Designing a safe RX-zero-copy Memory Model for Networking
On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote: > On Tue, 13 Dec 2016 12:08:21 -0800 > John Fastabendwrote: > >> On 16-12-13 11:53 AM, David Miller wrote: >>> From: John Fastabend >>> Date: Tue, 13 Dec 2016 09:43:59 -0800 >>> What does "zero-copy send packet-pages to the application/socket that requested this" mean? At the moment on x86 page-flipping appears to be more expensive than memcpy (I can post some data shortly) and shared memory was proposed and rejected for security reasons when we were working on bifurcated driver. >>> >>> The whole idea is that we map all the active RX ring pages into >>> userspace from the start. >>> >>> And just how Jesper's page pool work will avoid DMA map/unmap, >>> it will also avoid changing the userspace mapping of the pages >>> as well. >>> >>> Thus avoiding the TLB/VM overhead altogether. >>> > > Exactly. It is worth mentioning that pages entering the page pool need > to be cleared (measured cost 143 cycles), in order to not leak any > kernel info. The primary focus of this design is to make sure not to > leak kernel info to userspace, but with an "exclusive" mode also > support isolation between applications. > > >> I get this but it requires applications to be isolated. The pages from >> a queue can not be shared between multiple applications in different >> trust domains. And the application has to be cooperative meaning it >> can't "look" at data that has not been marked by the stack as OK. In >> these schemes we tend to end up with something like virtio/vhost or >> af_packet. > > I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first > two would require CAP_NET_ADMIN privileges. All modes have a trust > domain id, that need to match e.g. when page reach the socket. Even mode 3 should required cap_net_admin we don't want userspace to grab queues off the nic without it IMO. > > Mode-1 "Shared": Application choose lowest isolation level, allowing > multiple application to mmap VMA area. My only point here is applications can read each others data and all applications need to cooperate for example one app could try to write continuously to read only pages causing faults and what not. This is all non standard and doesn't play well with cgroups and "normal" applications. It requires a new orchestration model. I'm a bit skeptical of the use case but I know of a handful of reasons to use this model. Maybe take a look at the ivshmem implementation in DPDK. Also this still requires a hardware filter to push "application" traffic onto reserved queues/pages as far as I can tell. > > Mode-2 "Single-user": Application request it want to be the only user > of the RX queue. This blocks other application to mmap VMA area. > Assuming data is read-only sharing with the stack is possibly OK :/. I guess you would need to pools of memory for data and skb so you don't leak skb into user space. The devils in the details here. There are lots of hooks in the kernel that can for example push the packet with a 'redirect' tc action for example. And letting an app "read" data or impact performance of an unrelated application is wrong IMO. Stacked devices also provide another set of details that are a bit difficult to track down see all the hardware offload efforts. I assume all these concerns are shared between mode-1 and mode-2 > Mode-3 "Exclusive": Application request to own RX queue. Packets are > no longer allowed for normal netstack delivery. > I have patches for this mode already but haven't pushed them due to an alternative solution using VFIO. > Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are > still allowed to travel netstack and thus can contain packet data from > other normal applications. This is part of the design, to share the > NIC between netstack and an accelerated userspace application using RX > zero-copy delivery. > I don't think this is acceptable to be honest. Letting an application potentially read/impact other arbitrary applications on the system seems like a non-starter even with CAP_NET_ADMIN. At least this was the conclusion from bifurcated driver work some time ago. > >> Any ACLs/filtering/switching/headers need to be done in hardware or >> the application trust boundaries are broken. > > The software solution outlined allow the application to make the choice > of what trust boundary it wants. > > The "exclusive" mode-3 make most sense together with HW filters. > Already today, we support creating a new RX queue based on ethtool > ntuple HW filter and then you simply attach your application that queue > in mode-3, and have full isolation. > Still pretty fuzzy on why mode-1 and mode-2 do not need hw filters? Without hardware filters we have no way of knowing who/what data is put in the page. > >> If the above can not be met then a copy is needed. What I am trying >> to tease out is the above
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Tue, 13 Dec 2016 12:08:21 -0800 John Fastabendwrote: > On 16-12-13 11:53 AM, David Miller wrote: > > From: John Fastabend > > Date: Tue, 13 Dec 2016 09:43:59 -0800 > > > >> What does "zero-copy send packet-pages to the application/socket that > >> requested this" mean? At the moment on x86 page-flipping appears to be > >> more expensive than memcpy (I can post some data shortly) and shared > >> memory was proposed and rejected for security reasons when we were > >> working on bifurcated driver. > > > > The whole idea is that we map all the active RX ring pages into > > userspace from the start. > > > > And just how Jesper's page pool work will avoid DMA map/unmap, > > it will also avoid changing the userspace mapping of the pages > > as well. > > > > Thus avoiding the TLB/VM overhead altogether. > > Exactly. It is worth mentioning that pages entering the page pool need to be cleared (measured cost 143 cycles), in order to not leak any kernel info. The primary focus of this design is to make sure not to leak kernel info to userspace, but with an "exclusive" mode also support isolation between applications. > I get this but it requires applications to be isolated. The pages from > a queue can not be shared between multiple applications in different > trust domains. And the application has to be cooperative meaning it > can't "look" at data that has not been marked by the stack as OK. In > these schemes we tend to end up with something like virtio/vhost or > af_packet. I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first two would require CAP_NET_ADMIN privileges. All modes have a trust domain id, that need to match e.g. when page reach the socket. Mode-1 "Shared": Application choose lowest isolation level, allowing multiple application to mmap VMA area. Mode-2 "Single-user": Application request it want to be the only user of the RX queue. This blocks other application to mmap VMA area. Mode-3 "Exclusive": Application request to own RX queue. Packets are no longer allowed for normal netstack delivery. Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are still allowed to travel netstack and thus can contain packet data from other normal applications. This is part of the design, to share the NIC between netstack and an accelerated userspace application using RX zero-copy delivery. > Any ACLs/filtering/switching/headers need to be done in hardware or > the application trust boundaries are broken. The software solution outlined allow the application to make the choice of what trust boundary it wants. The "exclusive" mode-3 make most sense together with HW filters. Already today, we support creating a new RX queue based on ethtool ntuple HW filter and then you simply attach your application that queue in mode-3, and have full isolation. > If the above can not be met then a copy is needed. What I am trying > to tease out is the above comment along with other statements like > this "can be done with out HW filter features". Does this address your concerns? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: Designing a safe RX-zero-copy Memory Model for Networking
On 16-12-13 11:53 AM, David Miller wrote: > From: John Fastabend> Date: Tue, 13 Dec 2016 09:43:59 -0800 > >> What does "zero-copy send packet-pages to the application/socket that >> requested this" mean? At the moment on x86 page-flipping appears to be >> more expensive than memcpy (I can post some data shortly) and shared >> memory was proposed and rejected for security reasons when we were >> working on bifurcated driver. > > The whole idea is that we map all the active RX ring pages into > userspace from the start. > > And just how Jesper's page pool work will avoid DMA map/unmap, > it will also avoid changing the userspace mapping of the pages > as well. > > Thus avoiding the TLB/VM overhead altogether. > I get this but it requires applications to be isolated. The pages from a queue can not be shared between multiple applications in different trust domains. And the application has to be cooperative meaning it can't "look" at data that has not been marked by the stack as OK. In these schemes we tend to end up with something like virtio/vhost or af_packet. Any ACLs/filtering/switching/headers need to be done in hardware or the application trust boundaries are broken. If the above can not be met then a copy is needed. What I am trying to tease out is the above comment along with other statements like this "can be done with out HW filter features". .John
Re: Designing a safe RX-zero-copy Memory Model for Networking
From: John FastabendDate: Tue, 13 Dec 2016 09:43:59 -0800 > What does "zero-copy send packet-pages to the application/socket that > requested this" mean? At the moment on x86 page-flipping appears to be > more expensive than memcpy (I can post some data shortly) and shared > memory was proposed and rejected for security reasons when we were > working on bifurcated driver. The whole idea is that we map all the active RX ring pages into userspace from the start. And just how Jesper's page pool work will avoid DMA map/unmap, it will also avoid changing the userspace mapping of the pages as well. Thus avoiding the TLB/VM overhead altogether.
Re: Designing a safe RX-zero-copy Memory Model for Networking
On 13.12.2016 17:10, Jesper Dangaard Brouer wrote: >> What is bad about RDMA is that it is a separate kernel subsystem. >> What I would like to see is a deeper integration with the network >> stack so that memory regions can be registred with a network socket >> and work requests then can be submitted and processed that directly >> read and write in these regions. The network stack should provide the >> services that the hardware of the NIC does not suppport as usual. > > Interesting. So you even imagine sockets registering memory regions > with the NIC. If we had a proper NIC HW filter API across the drivers, > to register the steering rule (like ibv_create_flow), this would be > doable, but we don't (DPDK actually have an interesting proposal[1]) On a side note, this is what windows does with RIO ("registered I/O"). Maybe you want to look at the API to get some ideas: allocating and pinning down memory in user space and registering that with sockets to get zero-copy IO.
Re: Designing a safe RX-zero-copy Memory Model for Networking
On 16-12-13 08:10 AM, Jesper Dangaard Brouer wrote: > > On Mon, 12 Dec 2016 12:06:59 -0600 (CST) Christoph Lameter> wrote: >> On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote: >> >>> Hmmm. If you can rely on hardware setup to give you steering and >>> dedicated access to the RX rings. In those cases, I guess, the "push" >>> model could be a more direct API approach. >> >> If the hardware does not support steering then one should be able to >> provide those services in software. > > This is the early demux problem. With the push-mode of registering > memory, you need hardware steering support, for zero-copy support, as > the software step happens after DMA engine have written into the memory. > > My model pre-VMA map all the pages in the RX ring (if zero-copy gets > enabled, by a single user). The software step can filter and zero-copy > send packet-pages to the application/socket that requested this. The What does "zero-copy send packet-pages to the application/socket that requested this" mean? At the moment on x86 page-flipping appears to be more expensive than memcpy (I can post some data shortly) and shared memory was proposed and rejected for security reasons when we were working on bifurcated driver. > disadvantage is all zero-copy application need to share this VMA > mapping. This is solved by configuring HW filters into a RX-queue, and > then only attach your zero-copy application to that queue. > > >>> I was shooting for a model that worked without hardware support. >>> And then transparently benefit from HW support by configuring a HW >>> filter into a specific RX queue and attaching/using to that queue. >> >> The discussion here is a bit amusing since these issues have been >> resolved a long time ago with the design of the RDMA subsystem. Zero >> copy is already in wide use. Memory registration is used to pin down >> memory areas. Work requests can be filed with the RDMA subsystem that >> then send and receive packets from the registered memory regions. >> This is not strictly remote memory access but this is a basic mode of >> operations supported by the RDMA subsystem. The mlx5 driver quoted >> here supports all of that. > > I hear what you are saying. I will look into a push-model, as it might > be a better solution. > I will read up on RDMA + verbs and learn more about their API model. I > even plan to write a small sample program to get a feeling for the API, > and maybe we can use that as a baseline for the performance target we > can obtain on the same HW. (Thanks to Björn for already giving me some > pointer here) > > >> What is bad about RDMA is that it is a separate kernel subsystem. >> What I would like to see is a deeper integration with the network >> stack so that memory regions can be registred with a network socket >> and work requests then can be submitted and processed that directly >> read and write in these regions. The network stack should provide the >> services that the hardware of the NIC does not suppport as usual. > > Interesting. So you even imagine sockets registering memory regions > with the NIC. If we had a proper NIC HW filter API across the drivers, > to register the steering rule (like ibv_create_flow), this would be > doable, but we don't (DPDK actually have an interesting proposal[1]) > Note rte_flow is in the same family of APIs as the proposed Flow API that was rejected as well. The features in Flow API that are not included in the rte_flow proposal have logical extensions to support them. In kernel we have 'tc' and multiple vendors support cls_flower and cls_tc which offer a subset of the functionality in the DPDK implementation. Are you suggesting 'tc' is not a proper NIC HW filter API? > >> The RX/TX ring in user space should be an additional mode of >> operation of the socket layer. Once that is in place the "Remote >> memory acces" can be trivially implemented on top of that and the >> ugly RDMA sidecar subsystem can go away. > > I cannot follow that 100%, but I guess you are saying we also need a > more efficient mode of handing over pages/packet to userspace (than > going through the normal socket API calls). > > > Appreciate your input, it challenged my thinking. >
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Tue, 13 Dec 2016, Jesper Dangaard Brouer wrote: > This is the early demux problem. With the push-mode of registering > memory, you need hardware steering support, for zero-copy support, as > the software step happens after DMA engine have written into the memory. Right. But we could fall back to software. Transfer to a kernel buffer and then move stuff over. Not much of an improvment but it will make things work. > > The discussion here is a bit amusing since these issues have been > > resolved a long time ago with the design of the RDMA subsystem. Zero > > copy is already in wide use. Memory registration is used to pin down > > memory areas. Work requests can be filed with the RDMA subsystem that > > then send and receive packets from the registered memory regions. > > This is not strictly remote memory access but this is a basic mode of > > operations supported by the RDMA subsystem. The mlx5 driver quoted > > here supports all of that. > > I hear what you are saying. I will look into a push-model, as it might > be a better solution. > I will read up on RDMA + verbs and learn more about their API model. I > even plan to write a small sample program to get a feeling for the API, > and maybe we can use that as a baseline for the performance target we > can obtain on the same HW. (Thanks to Björn for already giving me some > pointer here) Great. > > What is bad about RDMA is that it is a separate kernel subsystem. > > What I would like to see is a deeper integration with the network > > stack so that memory regions can be registred with a network socket > > and work requests then can be submitted and processed that directly > > read and write in these regions. The network stack should provide the > > services that the hardware of the NIC does not suppport as usual. > > Interesting. So you even imagine sockets registering memory regions > with the NIC. If we had a proper NIC HW filter API across the drivers, > to register the steering rule (like ibv_create_flow), this would be > doable, but we don't (DPDK actually have an interesting proposal[1]) Well doing this would mean adding some features and that also would at best allow general support for zero copy direct to user space with a fallback to software if the hardware is missing some feature. > > The RX/TX ring in user space should be an additional mode of > > operation of the socket layer. Once that is in place the "Remote > > memory acces" can be trivially implemented on top of that and the > > ugly RDMA sidecar subsystem can go away. > > I cannot follow that 100%, but I guess you are saying we also need a > more efficient mode of handing over pages/packet to userspace (than > going through the normal socket API calls). A work request contains the user space address of the data to be sent and/or received. The address must be in a registered memory region. This is different from copying the packet into kernel data structures. I think this can easily be generalized. We need support for registering memory regions, submissions of work request and the processing of completion requets. QP (queue-pair) processing is probably the basis for the whole scheme that is used in multiple context these days.
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Mon, 12 Dec 2016 12:06:59 -0600 (CST) Christoph Lameterwrote: > On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote: > > > Hmmm. If you can rely on hardware setup to give you steering and > > dedicated access to the RX rings. In those cases, I guess, the "push" > > model could be a more direct API approach. > > If the hardware does not support steering then one should be able to > provide those services in software. This is the early demux problem. With the push-mode of registering memory, you need hardware steering support, for zero-copy support, as the software step happens after DMA engine have written into the memory. My model pre-VMA map all the pages in the RX ring (if zero-copy gets enabled, by a single user). The software step can filter and zero-copy send packet-pages to the application/socket that requested this. The disadvantage is all zero-copy application need to share this VMA mapping. This is solved by configuring HW filters into a RX-queue, and then only attach your zero-copy application to that queue. > > I was shooting for a model that worked without hardware support. > > And then transparently benefit from HW support by configuring a HW > > filter into a specific RX queue and attaching/using to that queue. > > The discussion here is a bit amusing since these issues have been > resolved a long time ago with the design of the RDMA subsystem. Zero > copy is already in wide use. Memory registration is used to pin down > memory areas. Work requests can be filed with the RDMA subsystem that > then send and receive packets from the registered memory regions. > This is not strictly remote memory access but this is a basic mode of > operations supported by the RDMA subsystem. The mlx5 driver quoted > here supports all of that. I hear what you are saying. I will look into a push-model, as it might be a better solution. I will read up on RDMA + verbs and learn more about their API model. I even plan to write a small sample program to get a feeling for the API, and maybe we can use that as a baseline for the performance target we can obtain on the same HW. (Thanks to Björn for already giving me some pointer here) > What is bad about RDMA is that it is a separate kernel subsystem. > What I would like to see is a deeper integration with the network > stack so that memory regions can be registred with a network socket > and work requests then can be submitted and processed that directly > read and write in these regions. The network stack should provide the > services that the hardware of the NIC does not suppport as usual. Interesting. So you even imagine sockets registering memory regions with the NIC. If we had a proper NIC HW filter API across the drivers, to register the steering rule (like ibv_create_flow), this would be doable, but we don't (DPDK actually have an interesting proposal[1]) > The RX/TX ring in user space should be an additional mode of > operation of the socket layer. Once that is in place the "Remote > memory acces" can be trivially implemented on top of that and the > ugly RDMA sidecar subsystem can go away. I cannot follow that 100%, but I guess you are saying we also need a more efficient mode of handing over pages/packet to userspace (than going through the normal socket API calls). Appreciate your input, it challenged my thinking. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer [1] https://rawgit.com/6WIND/rte_flow/master/rte_flow.html
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Mon, Dec 12, 2016 at 06:49:03AM -0800, John Fastabend wrote: > On 16-12-12 06:14 AM, Mike Rapoport wrote: > >> > > We were not considered using XDP yet, so we've decided to limit the initial > > implementation to macvtap because we can ensure correspondence between a > > NIC queue and virtual NIC, which is not the case with more generic tap > > device. It could be that use of XDP will allow for a generic solution for > > virtio case as well. > > Interesting this was one of the original ideas behind the macvlan > offload mode. iirc Vlad also was interested in this. > > I'm guessing this was used because of the ability to push macvlan onto > its own queue? Yes, with a queue dedicated to a virtual NIC we only need to ensure that guest memory is used for RX buffers. > >> > >>> Have you considered using "push" model for setting the NIC's RX memory? > >> > >> I don't understand what you mean by a "push" model? > > > > Currently, memory allocation in NIC drivers boils down to alloc_page with > > some wrapping code. I see two possible ways to make NIC use of some > > preallocated pages: either NIC driver will call an API (probably different > > from alloc_page) to obtain that memory, or there will be NDO API that > > allows to set the NIC's RX buffers. I named the later case "push". > > I prefer the ndo op. This matches up well with AF_PACKET model where we > have "slots" and offload is just a transparent "push" of these "slots" > to the driver. Below we have a snippet of our proposed API, > > (https://patchwork.ozlabs.org/patch/396714/ note the descriptor mapping > bits will be dropped) > > + * int (*ndo_direct_qpair_page_map) (struct vm_area_struct *vma, > + *struct net_device *dev) > + * Called to map queue pair range from split_queue_pairs into > + * mmap region. > + > > > + > > +static int > > +ixgbe_ndo_qpair_page_map(struct vm_area_struct *vma, struct net_device > > *dev) > > +{ > > + struct ixgbe_adapter *adapter = netdev_priv(dev); > > + phys_addr_t phy_addr = pci_resource_start(adapter->pdev, 0); > > + unsigned long pfn_rx = (phy_addr + RX_DESC_ADDR_OFFSET) >> PAGE_SHIFT; > > + unsigned long pfn_tx = (phy_addr + TX_DESC_ADDR_OFFSET) >> PAGE_SHIFT; > > + unsigned long dummy_page_phy; > > + pgprot_t pre_vm_page_prot; > > + unsigned long start; > > + unsigned int i; > > + int err; > > + > > + if (!dummy_page_buf) { > > + dummy_page_buf = kzalloc(PAGE_SIZE_4K, GFP_KERNEL); > > + if (!dummy_page_buf) > > + return -ENOMEM; > > + > > + for (i = 0; i < PAGE_SIZE_4K / sizeof(unsigned int); i++) > > + dummy_page_buf[i] = 0xdeadbeef; > > + } > > + > > + dummy_page_phy = virt_to_phys(dummy_page_buf); > > + pre_vm_page_prot = vma->vm_page_prot; > > + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); > > + > > + /* assume the vm_start is 4K aligned address */ > > + for (start = vma->vm_start; > > +start < vma->vm_end; > > +start += PAGE_SIZE_4K) { > > + if (start == vma->vm_start + RX_DESC_ADDR_OFFSET) { > > + err = remap_pfn_range(vma, start, pfn_rx, PAGE_SIZE_4K, > > + vma->vm_page_prot); > > + if (err) > > + return -EAGAIN; > > + } else if (start == vma->vm_start + TX_DESC_ADDR_OFFSET) { > > + err = remap_pfn_range(vma, start, pfn_tx, PAGE_SIZE_4K, > > + vma->vm_page_prot); > > + if (err) > > + return -EAGAIN; > > + } else { > > + unsigned long addr = dummy_page_phy > PAGE_SHIFT; > > + > > + err = remap_pfn_range(vma, start, addr, PAGE_SIZE_4K, > > + pre_vm_page_prot); > > + if (err) > > + return -EAGAIN; > > + } > > + } > > + return 0; > > +} > > + > > Any thoughts on something like the above? We could push it when net-next > opens. One piece that fits naturally into vhost/macvtap is the kicks and > queue splicing are already there so no need to implement this making the > above patch much simpler. Sorry, but I don't quite follow you here. The vhost does not use vma mappings, it just sees a bunch of pages pointed by the vring descriptors... > .John -- Sincerely yours, Mike.
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Mon, Dec 12, 2016 at 04:10:26PM +0100, Jesper Dangaard Brouer wrote: > On Mon, 12 Dec 2016 16:14:33 +0200 > Mike Rapoportwrote: > > > > They are copied :-) > > Presuming we are dealing only with vhost backend, the received skb > > eventually gets converted to IOVs, which in turn are copied to the guest > > memory. The IOVs point to the guest memory that is allocated by virtio-net > > running in the guest. > > Thanks for explaining that. It seems like a lot of overhead. I have to > wrap my head around this... so, the hardware NIC is receiving the > packet/page, in the RX ring, and after converting it to IOVs, it is > conceptually transmitted into the guest, and then the guest-side have a > RX-function to handle this packet. Correctly understood? Almost :) For the hardware NIC driver, the receive just follows the "normal" path. It creates an skb for the packet and passes it to the net core RX. Then the skb is delivered to tap/macvtap. The later converts the skb to IOVs and IOVs are pushed to the guest address space. On the guest side, virtio-net sees these IOVs as a part of its RX ring, it creates an skb for the packet and passes the skb to the net core of the guest. > > I'm not very familiar with XDP eBPF, and it's difficult for me to estimate > > what needs to be done in BPF program to do proper conversion of skb to the > > virtio descriptors. > > XDP is a step _before_ the SKB is allocated. The XDP eBPF program can > modify the packet-page data, but I don't think it is needed for your > use-case. View XDP (primarily) as an early (demux) filter. > > XDP is missing a feature your need, which is TX packet into another > net_device (I actually imagine a port mapping table, that point to a > net_device). This require a new "TX-raw" NDO that takes a page (+ > offset and length). > > I imagine, the virtio driver (virtio_net or a new driver?) getting > extended with this new "TX-raw" NDO, that takes "raw" packet-pages. > Whether zero-copy is possible is determined by checking if page > originates from a page_pool that have enabled zero-copy (and likely > matching against a "protection domain" id number). That could be quite a few drivers that will need to implement "TX-raw" then :) In general case, the virtual NIC may be connected to the physical network via long chain of virtual devices such as bridge, veth and ovs. Actually, because of that we wanted to concentrate on macvtap... > > We were not considered using XDP yet, so we've decided to limit the initial > > implementation to macvtap because we can ensure correspondence between a > > NIC queue and virtual NIC, which is not the case with more generic tap > > device. It could be that use of XDP will allow for a generic solution for > > virtio case as well. > > You don't need an XDP filter, if you can make the HW do the early demux > binding into a queue. The check for if memory is zero-copy enabled > would be the same. > > > > > > > > Have you considered using "push" model for setting the NIC's RX memory? > > > > > > > > > > I don't understand what you mean by a "push" model? > > > > Currently, memory allocation in NIC drivers boils down to alloc_page with > > some wrapping code. I see two possible ways to make NIC use of some > > preallocated pages: either NIC driver will call an API (probably different > > from alloc_page) to obtain that memory, or there will be NDO API that > > allows to set the NIC's RX buffers. I named the later case "push". > > As you might have guessed, I'm not into the "push" model, because this > means I cannot share the queue with the normal network stack. Which I > believe is possible as outlined (in email and [2]) and can be done with > out HW filter features (like macvlan). I think I should sleep on it a bit more :) Probably we can add page_pool "backend" implementation to vhost... -- Sincerely yours, Mike.
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote: > Hmmm. If you can rely on hardware setup to give you steering and > dedicated access to the RX rings. In those cases, I guess, the "push" > model could be a more direct API approach. If the hardware does not support steering then one should be able to provide those services in software. > I was shooting for a model that worked without hardware support. And > then transparently benefit from HW support by configuring a HW filter > into a specific RX queue and attaching/using to that queue. The discussion here is a bit amusing since these issues have been resolved a long time ago with the design of the RDMA subsystem. Zero copy is already in wide use. Memory registration is used to pin down memory areas. Work requests can be filed with the RDMA subsystem that then send and receive packets from the registered memory regions. This is not strictly remote memory access but this is a basic mode of operations supported by the RDMA subsystem. The mlx5 driver quoted here supports all of that. What is bad about RDMA is that it is a separate kernel subsystem. What I would like to see is a deeper integration with the network stack so that memory regions can be registred with a network socket and work requests then can be submitted and processed that directly read and write in these regions. The network stack should provide the services that the hardware of the NIC does not suppport as usual. The RX/TX ring in user space should be an additional mode of operation of the socket layer. Once that is in place the "Remote memory acces" can be trivially implemented on top of that and the ugly RDMA sidecar subsystem can go away.
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Mon, 12 Dec 2016 06:49:03 -0800 John Fastabendwrote: > On 16-12-12 06:14 AM, Mike Rapoport wrote: > > On Mon, Dec 12, 2016 at 10:40:42AM +0100, Jesper Dangaard Brouer wrote: > >> > >> On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoport > >> wrote: > >> > >>> Hello Jesper, > >>> > >>> On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote: > Hi all, > > This is my design for how to safely handle RX zero-copy in the network > stack, by using page_pool[1] and modifying NIC drivers. Safely means > not leaking kernel info in pages mapped to userspace and resilience > so a malicious userspace app cannot crash the kernel. > > Design target > = > > Allow the NIC to function as a normal Linux NIC and be shared in a > safe manor, between the kernel network stack and an accelerated > userspace application using RX zero-copy delivery. > > Target is to provide the basis for building RX zero-copy solutions in > a memory safe manor. An efficient communication channel for userspace > delivery is out of scope for this document, but OOM considerations are > discussed below (`Userspace delivery and OOM`_). > >>> > >>> Sorry, if this reply is a bit off-topic. > >> > >> It is very much on topic IMHO :-) > >> > >>> I'm working on implementation of RX zero-copy for virtio and I've > >>> dedicated > >>> some thought about making guest memory available for physical NIC DMAs. > >>> I believe this is quite related to your page_pool proposal, at least from > >>> the NIC driver perspective, so I'd like to share some thoughts here. > >> > >> Seems quite related. I'm very interested in cooperating with you! I'm > >> not very familiar with virtio, and how packets/pages gets channeled > >> into virtio. > > > > They are copied :-) > > Presuming we are dealing only with vhost backend, the received skb > > eventually gets converted to IOVs, which in turn are copied to the guest > > memory. The IOVs point to the guest memory that is allocated by virtio-net > > running in the guest. > > > > Great I'm also doing something similar. > > My plan was to embed the zero copy as an AF_PACKET mode and then push > a AF_PACKET backend into vhost. I'll post a patch later this week. > > >>> The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g. > >>> using macvtap, and then propagate guest RX memory allocations to the NIC > >>> using something like new .ndo_set_rx_buffers method. > >> > >> I believe the page_pool API/design aligns with this idea/use-case. > >> > >>> What is your view about interface between the page_pool and the NIC > >>> drivers? > >> > >> In my Prove-of-Concept implementation, the NIC driver (mlx5) register > >> a page_pool per RX queue. This is done for two reasons (1) performance > >> and (2) for supporting use-cases where only one single RX-ring queue is > >> (re)configured to support RX-zero-copy. There are some associated > >> extra cost of enabling this mode, thus it makes sense to only enable it > >> when needed. > >> > >> I've not decided how this gets enabled, maybe some new driver NDO. It > >> could also happen when a XDP program gets loaded, which request this > >> feature. > >> > >> The macvtap solution is nice and we should support it, but it requires > >> VM to have their MAC-addr registered on the physical switch. This > >> design is about adding flexibility. Registering an XDP eBPF filter > >> provides the maximum flexibility for matching the destination VM. > > > > I'm not very familiar with XDP eBPF, and it's difficult for me to estimate > > what needs to be done in BPF program to do proper conversion of skb to the > > virtio descriptors. > > I don't think XDP has much to do with this code and they should be done > separately. XDP runs eBPF code on received packets after the DMA engine > has already placed the packet in memory so its too late in the process. It does not have to be connected to XDP. My idea should support RX zero-copy into normal sockets, without XDP. My idea was to pre-VMA map the RX ring, when zero-copy is requested, thus it is not too late in the process. When frame travel the normal network stack, then require the SKB-read-only-page mode (skb-frags). If the SKB reach a socket that support zero-copy, then we can do RX zero-copy on normal sockets. > The other piece here is enabling XDP in vhost but that is again separate > IMO. > > Notice that ixgbe supports pushing packets into a macvlan via 'tc' > traffic steering commands so even though macvlan gets an L2 address it > doesn't mean it can't use other criteria to steer traffic to it. This sounds interesting. As this allow much more flexibility macvlan matching, which I like, but still depending on HW support. > > We were not considered using XDP yet, so we've decided to limit the initial > > implementation
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Mon, 12 Dec 2016 16:14:33 +0200 Mike Rapoportwrote: > On Mon, Dec 12, 2016 at 10:40:42AM +0100, Jesper Dangaard Brouer wrote: > > > > On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoport > > wrote: > > > > > Hello Jesper, > > > > > > On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote: > > > > Hi all, > > > > > > > > This is my design for how to safely handle RX zero-copy in the network > > > > stack, by using page_pool[1] and modifying NIC drivers. Safely means > > > > not leaking kernel info in pages mapped to userspace and resilience > > > > so a malicious userspace app cannot crash the kernel. > > > > > > > > Design target > > > > = > > > > > > > > Allow the NIC to function as a normal Linux NIC and be shared in a > > > > safe manor, between the kernel network stack and an accelerated > > > > userspace application using RX zero-copy delivery. > > > > > > > > Target is to provide the basis for building RX zero-copy solutions in > > > > a memory safe manor. An efficient communication channel for userspace > > > > delivery is out of scope for this document, but OOM considerations are > > > > discussed below (`Userspace delivery and OOM`_). > > > > > > Sorry, if this reply is a bit off-topic. > > > > It is very much on topic IMHO :-) > > > > > I'm working on implementation of RX zero-copy for virtio and I've > > > dedicated > > > some thought about making guest memory available for physical NIC DMAs. > > > I believe this is quite related to your page_pool proposal, at least from > > > the NIC driver perspective, so I'd like to share some thoughts here. > > > > Seems quite related. I'm very interested in cooperating with you! I'm > > not very familiar with virtio, and how packets/pages gets channeled > > into virtio. > > They are copied :-) > Presuming we are dealing only with vhost backend, the received skb > eventually gets converted to IOVs, which in turn are copied to the guest > memory. The IOVs point to the guest memory that is allocated by virtio-net > running in the guest. Thanks for explaining that. It seems like a lot of overhead. I have to wrap my head around this... so, the hardware NIC is receiving the packet/page, in the RX ring, and after converting it to IOVs, it is conceptually transmitted into the guest, and then the guest-side have a RX-function to handle this packet. Correctly understood? > > > The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g. > > > using macvtap, and then propagate guest RX memory allocations to the NIC > > > using something like new .ndo_set_rx_buffers method. > > > > I believe the page_pool API/design aligns with this idea/use-case. > > > > > What is your view about interface between the page_pool and the NIC > > > drivers? > > > > In my Prove-of-Concept implementation, the NIC driver (mlx5) register > > a page_pool per RX queue. This is done for two reasons (1) performance > > and (2) for supporting use-cases where only one single RX-ring queue is > > (re)configured to support RX-zero-copy. There are some associated > > extra cost of enabling this mode, thus it makes sense to only enable it > > when needed. > > > > I've not decided how this gets enabled, maybe some new driver NDO. It > > could also happen when a XDP program gets loaded, which request this > > feature. > > > > The macvtap solution is nice and we should support it, but it requires > > VM to have their MAC-addr registered on the physical switch. This > > design is about adding flexibility. Registering an XDP eBPF filter > > provides the maximum flexibility for matching the destination VM. > > I'm not very familiar with XDP eBPF, and it's difficult for me to estimate > what needs to be done in BPF program to do proper conversion of skb to the > virtio descriptors. XDP is a step _before_ the SKB is allocated. The XDP eBPF program can modify the packet-page data, but I don't think it is needed for your use-case. View XDP (primarily) as an early (demux) filter. XDP is missing a feature your need, which is TX packet into another net_device (I actually imagine a port mapping table, that point to a net_device). This require a new "TX-raw" NDO that takes a page (+ offset and length). I imagine, the virtio driver (virtio_net or a new driver?) getting extended with this new "TX-raw" NDO, that takes "raw" packet-pages. Whether zero-copy is possible is determined by checking if page originates from a page_pool that have enabled zero-copy (and likely matching against a "protection domain" id number). > We were not considered using XDP yet, so we've decided to limit the initial > implementation to macvtap because we can ensure correspondence between a > NIC queue and virtual NIC, which is not the case with more generic tap > device. It could be that use of XDP will allow for a generic solution for > virtio case as well. You don't need an XDP
Re: Designing a safe RX-zero-copy Memory Model for Networking
On 16-12-12 06:14 AM, Mike Rapoport wrote: > On Mon, Dec 12, 2016 at 10:40:42AM +0100, Jesper Dangaard Brouer wrote: >> >> On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoport>> wrote: >> >>> Hello Jesper, >>> >>> On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote: Hi all, This is my design for how to safely handle RX zero-copy in the network stack, by using page_pool[1] and modifying NIC drivers. Safely means not leaking kernel info in pages mapped to userspace and resilience so a malicious userspace app cannot crash the kernel. Design target = Allow the NIC to function as a normal Linux NIC and be shared in a safe manor, between the kernel network stack and an accelerated userspace application using RX zero-copy delivery. Target is to provide the basis for building RX zero-copy solutions in a memory safe manor. An efficient communication channel for userspace delivery is out of scope for this document, but OOM considerations are discussed below (`Userspace delivery and OOM`_). >>> >>> Sorry, if this reply is a bit off-topic. >> >> It is very much on topic IMHO :-) >> >>> I'm working on implementation of RX zero-copy for virtio and I've dedicated >>> some thought about making guest memory available for physical NIC DMAs. >>> I believe this is quite related to your page_pool proposal, at least from >>> the NIC driver perspective, so I'd like to share some thoughts here. >> >> Seems quite related. I'm very interested in cooperating with you! I'm >> not very familiar with virtio, and how packets/pages gets channeled >> into virtio. > > They are copied :-) > Presuming we are dealing only with vhost backend, the received skb > eventually gets converted to IOVs, which in turn are copied to the guest > memory. The IOVs point to the guest memory that is allocated by virtio-net > running in the guest. > Great I'm also doing something similar. My plan was to embed the zero copy as an AF_PACKET mode and then push a AF_PACKET backend into vhost. I'll post a patch later this week. >>> The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g. >>> using macvtap, and then propagate guest RX memory allocations to the NIC >>> using something like new .ndo_set_rx_buffers method. >> >> I believe the page_pool API/design aligns with this idea/use-case. >> >>> What is your view about interface between the page_pool and the NIC >>> drivers? >> >> In my Prove-of-Concept implementation, the NIC driver (mlx5) register >> a page_pool per RX queue. This is done for two reasons (1) performance >> and (2) for supporting use-cases where only one single RX-ring queue is >> (re)configured to support RX-zero-copy. There are some associated >> extra cost of enabling this mode, thus it makes sense to only enable it >> when needed. >> >> I've not decided how this gets enabled, maybe some new driver NDO. It >> could also happen when a XDP program gets loaded, which request this >> feature. >> >> The macvtap solution is nice and we should support it, but it requires >> VM to have their MAC-addr registered on the physical switch. This >> design is about adding flexibility. Registering an XDP eBPF filter >> provides the maximum flexibility for matching the destination VM. > > I'm not very familiar with XDP eBPF, and it's difficult for me to estimate > what needs to be done in BPF program to do proper conversion of skb to the > virtio descriptors. I don't think XDP has much to do with this code and they should be done separately. XDP runs eBPF code on received packets after the DMA engine has already placed the packet in memory so its too late in the process. The other piece here is enabling XDP in vhost but that is again separate IMO. Notice that ixgbe supports pushing packets into a macvlan via 'tc' traffic steering commands so even though macvlan gets an L2 address it doesn't mean it can't use other criteria to steer traffic to it. > > We were not considered using XDP yet, so we've decided to limit the initial > implementation to macvtap because we can ensure correspondence between a > NIC queue and virtual NIC, which is not the case with more generic tap > device. It could be that use of XDP will allow for a generic solution for > virtio case as well. Interesting this was one of the original ideas behind the macvlan offload mode. iirc Vlad also was interested in this. I'm guessing this was used because of the ability to push macvlan onto its own queue? > >> >>> Have you considered using "push" model for setting the NIC's RX memory? >> >> I don't understand what you mean by a "push" model? > > Currently, memory allocation in NIC drivers boils down to alloc_page with > some wrapping code. I see two possible ways to make NIC use of some > preallocated pages: either NIC driver will call an API (probably different > from alloc_page) to obtain that memory, or there
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Mon, Dec 12, 2016 at 10:40:42AM +0100, Jesper Dangaard Brouer wrote: > > On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoport> wrote: > > > Hello Jesper, > > > > On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote: > > > Hi all, > > > > > > This is my design for how to safely handle RX zero-copy in the network > > > stack, by using page_pool[1] and modifying NIC drivers. Safely means > > > not leaking kernel info in pages mapped to userspace and resilience > > > so a malicious userspace app cannot crash the kernel. > > > > > > Design target > > > = > > > > > > Allow the NIC to function as a normal Linux NIC and be shared in a > > > safe manor, between the kernel network stack and an accelerated > > > userspace application using RX zero-copy delivery. > > > > > > Target is to provide the basis for building RX zero-copy solutions in > > > a memory safe manor. An efficient communication channel for userspace > > > delivery is out of scope for this document, but OOM considerations are > > > discussed below (`Userspace delivery and OOM`_). > > > > Sorry, if this reply is a bit off-topic. > > It is very much on topic IMHO :-) > > > I'm working on implementation of RX zero-copy for virtio and I've dedicated > > some thought about making guest memory available for physical NIC DMAs. > > I believe this is quite related to your page_pool proposal, at least from > > the NIC driver perspective, so I'd like to share some thoughts here. > > Seems quite related. I'm very interested in cooperating with you! I'm > not very familiar with virtio, and how packets/pages gets channeled > into virtio. They are copied :-) Presuming we are dealing only with vhost backend, the received skb eventually gets converted to IOVs, which in turn are copied to the guest memory. The IOVs point to the guest memory that is allocated by virtio-net running in the guest. > > The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g. > > using macvtap, and then propagate guest RX memory allocations to the NIC > > using something like new .ndo_set_rx_buffers method. > > I believe the page_pool API/design aligns with this idea/use-case. > > > What is your view about interface between the page_pool and the NIC > > drivers? > > In my Prove-of-Concept implementation, the NIC driver (mlx5) register > a page_pool per RX queue. This is done for two reasons (1) performance > and (2) for supporting use-cases where only one single RX-ring queue is > (re)configured to support RX-zero-copy. There are some associated > extra cost of enabling this mode, thus it makes sense to only enable it > when needed. > > I've not decided how this gets enabled, maybe some new driver NDO. It > could also happen when a XDP program gets loaded, which request this > feature. > > The macvtap solution is nice and we should support it, but it requires > VM to have their MAC-addr registered on the physical switch. This > design is about adding flexibility. Registering an XDP eBPF filter > provides the maximum flexibility for matching the destination VM. I'm not very familiar with XDP eBPF, and it's difficult for me to estimate what needs to be done in BPF program to do proper conversion of skb to the virtio descriptors. We were not considered using XDP yet, so we've decided to limit the initial implementation to macvtap because we can ensure correspondence between a NIC queue and virtual NIC, which is not the case with more generic tap device. It could be that use of XDP will allow for a generic solution for virtio case as well. > > > Have you considered using "push" model for setting the NIC's RX memory? > > I don't understand what you mean by a "push" model? Currently, memory allocation in NIC drivers boils down to alloc_page with some wrapping code. I see two possible ways to make NIC use of some preallocated pages: either NIC driver will call an API (probably different from alloc_page) to obtain that memory, or there will be NDO API that allows to set the NIC's RX buffers. I named the later case "push". -- Sincerely yours, Mike.
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoportwrote: > Hello Jesper, > > On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote: > > Hi all, > > > > This is my design for how to safely handle RX zero-copy in the network > > stack, by using page_pool[1] and modifying NIC drivers. Safely means > > not leaking kernel info in pages mapped to userspace and resilience > > so a malicious userspace app cannot crash the kernel. > > > > Design target > > = > > > > Allow the NIC to function as a normal Linux NIC and be shared in a > > safe manor, between the kernel network stack and an accelerated > > userspace application using RX zero-copy delivery. > > > > Target is to provide the basis for building RX zero-copy solutions in > > a memory safe manor. An efficient communication channel for userspace > > delivery is out of scope for this document, but OOM considerations are > > discussed below (`Userspace delivery and OOM`_). > > Sorry, if this reply is a bit off-topic. It is very much on topic IMHO :-) > I'm working on implementation of RX zero-copy for virtio and I've dedicated > some thought about making guest memory available for physical NIC DMAs. > I believe this is quite related to your page_pool proposal, at least from > the NIC driver perspective, so I'd like to share some thoughts here. Seems quite related. I'm very interested in cooperating with you! I'm not very familiar with virtio, and how packets/pages gets channeled into virtio. > The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g. > using macvtap, and then propagate guest RX memory allocations to the NIC > using something like new .ndo_set_rx_buffers method. I believe the page_pool API/design aligns with this idea/use-case. > What is your view about interface between the page_pool and the NIC > drivers? In my Prove-of-Concept implementation, the NIC driver (mlx5) register a page_pool per RX queue. This is done for two reasons (1) performance and (2) for supporting use-cases where only one single RX-ring queue is (re)configured to support RX-zero-copy. There are some associated extra cost of enabling this mode, thus it makes sense to only enable it when needed. I've not decided how this gets enabled, maybe some new driver NDO. It could also happen when a XDP program gets loaded, which request this feature. The macvtap solution is nice and we should support it, but it requires VM to have their MAC-addr registered on the physical switch. This design is about adding flexibility. Registering an XDP eBPF filter provides the maximum flexibility for matching the destination VM. > Have you considered using "push" model for setting the NIC's RX memory? I don't understand what you mean by a "push" model? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: Designing a safe RX-zero-copy Memory Model for Networking
Hello Jesper, On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote: > Hi all, > > This is my design for how to safely handle RX zero-copy in the network > stack, by using page_pool[1] and modifying NIC drivers. Safely means > not leaking kernel info in pages mapped to userspace and resilience > so a malicious userspace app cannot crash the kernel. > > Design target > = > > Allow the NIC to function as a normal Linux NIC and be shared in a > safe manor, between the kernel network stack and an accelerated > userspace application using RX zero-copy delivery. > > Target is to provide the basis for building RX zero-copy solutions in > a memory safe manor. An efficient communication channel for userspace > delivery is out of scope for this document, but OOM considerations are > discussed below (`Userspace delivery and OOM`_). Sorry, if this reply is a bit off-topic. I'm working on implementation of RX zero-copy for virtio and I've dedicated some thought about making guest memory available for physical NIC DMAs. I believe this is quite related to your page_pool proposal, at least from the NIC driver perspective, so I'd like to share some thoughts here. The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g. using macvtap, and then propagate guest RX memory allocations to the NIC using something like new .ndo_set_rx_buffers method. What is your view about interface between the page_pool and the NIC drivers? Have you considered using "push" model for setting the NIC's RX memory? > > -- > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer > > Above document is taken at GitHub commit 47fa7c844f48fab8b > https://github.com/netoptimizer/prototype-kernel/commit/47fa7c844f48fab8b > -- Sincerely yours, Mike.
Designing a safe RX-zero-copy Memory Model for Networking
Hi all, This is my design for how to safely handle RX zero-copy in the network stack, by using page_pool[1] and modifying NIC drivers. Safely means not leaking kernel info in pages mapped to userspace and resilience so a malicious userspace app cannot crash the kernel. It is only a design, and thus the purpose is for you to find any holes in this design ;-) Below text is also available as html see[2]. [1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/design.html [2] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html === Memory Model for Networking === This design describes how the page_pool change the memory model for networking in the NIC (Network Interface Card) drivers. .. Note:: The catch for driver developers is that, once an application request zero-copy RX, then the driver must use a specific SKB allocation mode and might have to reconfigure the RX-ring. Design target = Allow the NIC to function as a normal Linux NIC and be shared in a safe manor, between the kernel network stack and an accelerated userspace application using RX zero-copy delivery. Target is to provide the basis for building RX zero-copy solutions in a memory safe manor. An efficient communication channel for userspace delivery is out of scope for this document, but OOM considerations are discussed below (`Userspace delivery and OOM`_). Background == The SKB or ``struct sk_buff`` is the fundamental meta-data structure for network packets in the Linux Kernel network stack. It is a fairly complex object and can be constructed in several ways. >From a memory perspective there are two ways depending on RX-buffer/page state: 1) Writable packet page 2) Read-only packet page To take full potential of the page_pool, the drivers must actually support handling both options depending on the configuration state of the page_pool. Writable packet page When the RX packet page is writable, the SKB setup is fairly straight forward. The SKB->data (and skb->head) can point directly to the page data, adjusting the offset according to drivers headroom (for adding headers) and setting the length according to the DMA descriptor info. The page/data need to be writable, because the network stack need to adjust headers (like TimeToLive and checksum) or even add or remove headers for encapsulation purposes. A subtle catch, which also requires a writable page, is that the SKB also have an accompanying "shared info" data-structure ``struct skb_shared_info``. This "skb_shared_info" is written into the skb->data memory area at the end (skb->end) of the (header) data. The skb_shared_info contains semi-sensitive information, like kernel memory pointers to other pages (which might be pointers to more packet data). This would be bad from a zero-copy point of view to leak this kind of information. Read-only packet page - When the RX packet page is read-only, the construction of the SKB is significantly more complicated and even involves one more memory allocation. 1) Allocate a new separate writable memory area, and point skb->data here. This is needed due to (above described) skb_shared_info. 2) Memcpy packet headers into this (skb->data) area. 3) Clear part of skb_shared_info struct in writable-area. 4) Setup pointer to packet-data in the page (in skb_shared_info->frags) and adjust the page_offset to be past the headers just copied. It is useful (later) that the network stack have this notion that part of the packet and a page can be read-only. This implies that the kernel will not "pollute" this memory with any sensitive information. This is good from a zero-copy point of view, but bad from a performance perspective. NIC RX Zero-Copy Doing NIC RX zero-copy involves mapping RX pages into userspace. This involves costly mapping and unmapping operations in the address space of the userspace process. Plus for doing this safely, the page memory need to be cleared before using it, to avoid leaking kernel information to userspace, also a costly operation. The page_pool base "class" of optimization is moving these kind of operations out of the fastpath, by recycling and lifetime control. Once a NIC RX-queue's page_pool have been configured for zero-copy into userspace, then can packets still be allowed to travel the normal stack? Yes, this should be possible, because the driver can use the SKB-read-only mode, which avoids polluting the page data with kernel-side sensitive data. This implies, when a driver RX-queue switch page_pool to RX-zero-copy mode it MUST also switch to SKB-read-only mode (for normal stack delivery for this RXq). XDP can be used for controlling which pages that gets RX zero-copied to userspace. The page is still writable for the XDP program, but read-only for normal stack delivery.