Re: [RFC PATCH v2 00/11] Device Memory TCP

2023-08-23 Thread David Wei
On 17/08/2023 15:18, Mina Almasry wrote:
> On Thu, Aug 17, 2023 at 11:04 AM Pavel Begunkov  
> wrote:
>>
>> On 8/14/23 02:12, David Ahern wrote:
>>> On 8/9/23 7:57 PM, Mina Almasry wrote:
 Changes in RFC v2:
 --
>> ...
 ** Test Setup

 Kernel: net-next with this RFC and memory provider API cherry-picked
 locally.

 Hardware: Google Cloud A3 VMs.

 NIC: GVE with header split & RSS & flow steering support.
>>>
>>> This set seems to depend on Jakub's memory provider patches and a netdev
>>> driver change which is not included. For the testing mentioned here, you
>>> must have a tree + branch with all of the patches. Is it publicly available?
>>>
>>> It would be interesting to see how well (easy) this integrates with
>>> io_uring. Besides avoiding all of the syscalls for receiving the iov and
>>> releasing the buffers back to the pool, io_uring also brings in the
>>> ability to seed a page_pool with registered buffers which provides a
>>> means to get simpler Rx ZC for host memory.
>>
>> The patchset sounds pretty interesting. I've been working with David Wei
>> (CC'ing) on io_uring zc rx (prototype polishing stage) all that is old
>> similar approaches based on allocating an rx queue. It targets host
>> memory and device memory as an extra feature, uapi is different, lifetimes
>> are managed/bound to io_uring. Completions/buffers are returned to user via
>> a separate queue instead of cmsg, and pushed back granularly to the kernel
>> via another queue. I'll leave it to David to elaborate
>>
>> It sounds like we have space for collaboration here, if not merging then
>> reusing internals as much as we can, but we'd need to look into the
>> details deeper.
>>
> 
> I'm happy to look at your implementation and collaborate on something
> that works for both use cases. Feel free to share unpolished prototype
> so I can start having a general idea if possible.

Hi I'm David and I am working with Pavel on this. We will have something to
share with you on the mailing list before the end of the week.

I'm also preparing a submission for NetDev conf. I wonder if you and others at
Google plan to present there as well? If so, then we may want to coordinate our
submissions and talks (if accepted).

Please let me know this week, thanks!

> 
>>> Overall I like the intent and possibilities for extensions, but a lot of
>>> details are missing - perhaps some are answered by seeing an end-to-end
>>> implementation.
>>
>> --
>> Pavel Begunkov
> 
> 
> 


Re: [RFC PATCH v2 00/11] Device Memory TCP

2023-08-23 Thread David Ahern
On 8/23/23 3:52 PM, David Wei wrote:
> I'm also preparing a submission for NetDev conf. I wonder if you and others at
> Google plan to present there as well? If so, then we may want to coordinate 
> our
> submissions and talks (if accepted).

personally, I see them as related but separate topics. Mina's proposal
as infra that io_uring builds on. Both are interesting and needed
discussions.


Re: [RFC PATCH v2 00/11] Device Memory TCP

2023-08-18 Thread Pavel Begunkov

On 8/14/23 02:12, David Ahern wrote:

On 8/9/23 7:57 PM, Mina Almasry wrote:

Changes in RFC v2:
--

...

** Test Setup

Kernel: net-next with this RFC and memory provider API cherry-picked
locally.

Hardware: Google Cloud A3 VMs.

NIC: GVE with header split & RSS & flow steering support.


This set seems to depend on Jakub's memory provider patches and a netdev
driver change which is not included. For the testing mentioned here, you
must have a tree + branch with all of the patches. Is it publicly available?

It would be interesting to see how well (easy) this integrates with
io_uring. Besides avoiding all of the syscalls for receiving the iov and
releasing the buffers back to the pool, io_uring also brings in the
ability to seed a page_pool with registered buffers which provides a
means to get simpler Rx ZC for host memory.


The patchset sounds pretty interesting. I've been working with David Wei
(CC'ing) on io_uring zc rx (prototype polishing stage) all that is old
similar approaches based on allocating an rx queue. It targets host
memory and device memory as an extra feature, uapi is different, lifetimes
are managed/bound to io_uring. Completions/buffers are returned to user via
a separate queue instead of cmsg, and pushed back granularly to the kernel
via another queue. I'll leave it to David to elaborate

It sounds like we have space for collaboration here, if not merging then
reusing internals as much as we can, but we'd need to look into the
details deeper.


Overall I like the intent and possibilities for extensions, but a lot of
details are missing - perhaps some are answered by seeing an end-to-end
implementation.


--
Pavel Begunkov


Re: [RFC PATCH v2 00/11] Device Memory TCP

2023-08-17 Thread Mina Almasry
On Thu, Aug 17, 2023 at 11:04 AM Pavel Begunkov  wrote:
>
> On 8/14/23 02:12, David Ahern wrote:
> > On 8/9/23 7:57 PM, Mina Almasry wrote:
> >> Changes in RFC v2:
> >> --
> ...
> >> ** Test Setup
> >>
> >> Kernel: net-next with this RFC and memory provider API cherry-picked
> >> locally.
> >>
> >> Hardware: Google Cloud A3 VMs.
> >>
> >> NIC: GVE with header split & RSS & flow steering support.
> >
> > This set seems to depend on Jakub's memory provider patches and a netdev
> > driver change which is not included. For the testing mentioned here, you
> > must have a tree + branch with all of the patches. Is it publicly available?
> >
> > It would be interesting to see how well (easy) this integrates with
> > io_uring. Besides avoiding all of the syscalls for receiving the iov and
> > releasing the buffers back to the pool, io_uring also brings in the
> > ability to seed a page_pool with registered buffers which provides a
> > means to get simpler Rx ZC for host memory.
>
> The patchset sounds pretty interesting. I've been working with David Wei
> (CC'ing) on io_uring zc rx (prototype polishing stage) all that is old
> similar approaches based on allocating an rx queue. It targets host
> memory and device memory as an extra feature, uapi is different, lifetimes
> are managed/bound to io_uring. Completions/buffers are returned to user via
> a separate queue instead of cmsg, and pushed back granularly to the kernel
> via another queue. I'll leave it to David to elaborate
>
> It sounds like we have space for collaboration here, if not merging then
> reusing internals as much as we can, but we'd need to look into the
> details deeper.
>

I'm happy to look at your implementation and collaborate on something
that works for both use cases. Feel free to share unpolished prototype
so I can start having a general idea if possible.

> > Overall I like the intent and possibilities for extensions, but a lot of
> > details are missing - perhaps some are answered by seeing an end-to-end
> > implementation.
>
> --
> Pavel Begunkov



-- 
Thanks,
Mina


Re: [RFC PATCH v2 00/11] Device Memory TCP

2023-08-16 Thread Willem de Bruijn
On Tue, Aug 15, 2023 at 9:38 AM David Laight  wrote:
>
> From: Mina Almasry
> > Sent: 10 August 2023 02:58
> ...
> > * TL;DR:
> >
> > Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
> > from device memory efficiently, without bouncing the data to a host memory
> > buffer.
>
> Doesn't that really require peer-to-peer PCIe transfers?
> IIRC these aren't supported by many root hubs and have
> fundamental flow control and/or TLP credit problems.
>
> I'd guess they are also pretty incompatible with IOMMU?

Yes, this is a form of PCI_P2PDMA and all the limitations of that apply.

> I can see how you might manage to transmit frames from
> some external memory (eg after encryption) but surely
> processing receive data that way needs the packets
> be filtered by both IP addresses and port numbers before
> being redirected to the (presumably limited) external
> memory.

This feature depends on NIC receive header split. The TCP/IP headers
are stored to host memory, the payload to device memory.

Optionally, on devices that do not support explicit header-split, but
do support scatter-gather I/O, if the header size is constant and
known, that can be used as a weak substitute. This has additional
caveats wrt unexpected traffic for which payload must be host visible
(e.g., ICMP).

> OTOH isn't the kernel going to need to run code before
> the packet is actually sent and just after it is received?
> So all you might gain is a bit of latency?
> And a bit less utilisation of host memory??
> But if your system is really limited by cpu-memory bandwidth
> you need more cache :-)
>
>
> So how much benefit is there over efficient use of host
> memory bounce buffers??

Among other things, on a PCIe tree this makes it possible to load up
machines with many NICs + GPUs.


RE: [RFC PATCH v2 00/11] Device Memory TCP

2023-08-15 Thread David Laight
From: Mina Almasry
> Sent: 10 August 2023 02:58
...
> * TL;DR:
> 
> Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
> from device memory efficiently, without bouncing the data to a host memory
> buffer.

Doesn't that really require peer-to-peer PCIe transfers?
IIRC these aren't supported by many root hubs and have
fundamental flow control and/or TLP credit problems.

I'd guess they are also pretty incompatible with IOMMU?

I can see how you might manage to transmit frames from
some external memory (eg after encryption) but surely
processing receive data that way needs the packets
be filtered by both IP addresses and port numbers before
being redirected to the (presumably limited) external
memory.

OTOH isn't the kernel going to need to run code before
the packet is actually sent and just after it is received?
So all you might gain is a bit of latency?
And a bit less utilisation of host memory??
But if your system is really limited by cpu-memory bandwidth
you need more cache :-)

So how much benefit is there over efficient use of host
memory bounce buffers??

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: [RFC PATCH v2 00/11] Device Memory TCP

2023-08-13 Thread Mina Almasry
On Sun, Aug 13, 2023 at 6:12 PM David Ahern  wrote:
>
> On 8/9/23 7:57 PM, Mina Almasry wrote:
> > Changes in RFC v2:
> > --
> >
> > The sticking point in RFC v1[1] was the dma-buf pages approach we used to
> > deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
> > that attempts to resolve this by implementing scatterlist support in the
> > networking stack, such that we can import the dma-buf scatterlist
> > directly. This is the approach proposed at a high level here[2].
> >
> > Detailed changes:
> > 1. Replaced dma-buf pages approach with importing scatterlist into the
> >page pool.
> > 2. Replace the dma-buf pages centric API with a netlink API.
> > 3. Removed the TX path implementation - there is no issue with
> >implementing the TX path with scatterlist approach, but leaving
> >out the TX path makes it easier to review.
> > 4. Functionality is tested with this proposal, but I have not conducted
> >perf testing yet. I'm not sure there are regressions, but I removed
> >perf claims from the cover letter until they can be re-confirmed.
> > 5. Added Signed-off-by: contributors to the implementation.
> > 6. Fixed some bugs with the RX path since RFC v1.
> >
> > Any feedback welcome, but specifically the biggest pending questions
> > needing feedback IMO are:
> >
> > 1. Feedback on the scatterlist-based approach in general.
> > 2. Netlink API (Patch 1 & 2).
> > 3. Approach to handle all the drivers that expect to receive pages from
> >the page pool (Patch 6).
> >
> > [1] 
> > https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb...@gmail.com/T/
> > [2] 
> > https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXCc=h...@mail.gmail.com/
> >
> > --
> >
> > * TL;DR:
> >
> > Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
> > from device memory efficiently, without bouncing the data to a host memory
> > buffer.
> >
> > * Problem:
> >
> > A large amount of data transfers have device memory as the source and/or
> > destination. Accelerators drastically increased the volume of such 
> > transfers.
> > Some examples include:
> > - ML accelerators transferring large amounts of training data from storage 
> > into
> >   GPU/TPU memory. In some cases ML training setup time can be as long as 
> > 50% of
> >   TPU compute time, improving data transfer throughput & efficiency can help
> >   improving GPU/TPU utilization.
> >
> > - Distributed training, where ML accelerators, such as GPUs on different 
> > hosts,
> >   exchange data among them.
> >
> > - Distributed raw block storage applications transfer large amounts of data 
> > with
> >   remote SSDs, much of this data does not require host processing.
> >
> > Today, the majority of the Device-to-Device data transfers the network are
> > implemented as the following low level operations: Device-to-Host copy,
> > Host-to-Host network transfer, and Host-to-Device copy.
> >
> > The implementation is suboptimal, especially for bulk data transfers, and 
> > can
> > put significant strains on system resources, such as host memory bandwidth,
> > PCIe bandwidth, etc. One important reason behind the current state is the
> > kernel’s lack of semantics to express device to network transfers.
> >
> > * Proposal:
> >
> > In this patch series we attempt to optimize this use case by implementing
> > socket APIs that enable the user to:
> >
> > 1. send device memory across the network directly, and
> > 2. receive incoming network packets directly into device memory.
> >
> > Packet _payloads_ go directly from the NIC to device memory for receive and 
> > from
> > device memory to NIC for transmit.
> > Packet _headers_ go to/from host memory and are processed by the TCP/IP 
> > stack
> > normally. The NIC _must_ support header split to achieve this.
> >
> > Advantages:
> >
> > - Alleviate host memory bandwidth pressure, compared to existing
> >  network-transfer + device-copy semantics.
> >
> > - Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
> >   of the PCIe tree, compared to traditional path which sends data through 
> > the
> >   root complex.
> >
> > * Patch overview:
> >
> > ** Part 1: netlink API
> >
> > Gives user ability to bind dma-buf to an RX queue.
> >
> > ** Part 2: scatterlist support
> >
> > Currently the standard for device memory sharing is DMABUF, which doesn't
> > generate struct pages. On the other hand, networking stack (skbs, drivers, 
> > and
> > page pool) operate on pages. We have 2 options:
> >
> > 1. Generate struct pages for dmabuf device memory, or,
> > 2. Modify the networking stack to process scatterlist.
> >
> > Approach #1 was attempted in RFC v1. RFC v2 implements approach #2.
> >
> > ** part 3: page pool support
> >
> > We piggy back on page pool memory providers proposal:
> > https://github.com/kuba-moo/linux/tree/pp-providers
> >
> > It allows the page pool to define a memory 

Re: [RFC PATCH v2 00/11] Device Memory TCP

2023-08-13 Thread David Ahern
On 8/9/23 7:57 PM, Mina Almasry wrote:
> Changes in RFC v2:
> --
> 
> The sticking point in RFC v1[1] was the dma-buf pages approach we used to
> deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
> that attempts to resolve this by implementing scatterlist support in the
> networking stack, such that we can import the dma-buf scatterlist
> directly. This is the approach proposed at a high level here[2].
> 
> Detailed changes:
> 1. Replaced dma-buf pages approach with importing scatterlist into the
>page pool.
> 2. Replace the dma-buf pages centric API with a netlink API.
> 3. Removed the TX path implementation - there is no issue with
>implementing the TX path with scatterlist approach, but leaving
>out the TX path makes it easier to review.
> 4. Functionality is tested with this proposal, but I have not conducted
>perf testing yet. I'm not sure there are regressions, but I removed
>perf claims from the cover letter until they can be re-confirmed.
> 5. Added Signed-off-by: contributors to the implementation.
> 6. Fixed some bugs with the RX path since RFC v1.
> 
> Any feedback welcome, but specifically the biggest pending questions
> needing feedback IMO are:
> 
> 1. Feedback on the scatterlist-based approach in general.
> 2. Netlink API (Patch 1 & 2).
> 3. Approach to handle all the drivers that expect to receive pages from
>the page pool (Patch 6).
> 
> [1] 
> https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb...@gmail.com/T/
> [2] 
> https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXCc=h...@mail.gmail.com/
> 
> --
> 
> * TL;DR:
> 
> Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
> from device memory efficiently, without bouncing the data to a host memory
> buffer.
> 
> * Problem:
> 
> A large amount of data transfers have device memory as the source and/or
> destination. Accelerators drastically increased the volume of such transfers.
> Some examples include:
> - ML accelerators transferring large amounts of training data from storage 
> into
>   GPU/TPU memory. In some cases ML training setup time can be as long as 50% 
> of
>   TPU compute time, improving data transfer throughput & efficiency can help
>   improving GPU/TPU utilization.
> 
> - Distributed training, where ML accelerators, such as GPUs on different 
> hosts,
>   exchange data among them.
> 
> - Distributed raw block storage applications transfer large amounts of data 
> with
>   remote SSDs, much of this data does not require host processing.
> 
> Today, the majority of the Device-to-Device data transfers the network are
> implemented as the following low level operations: Device-to-Host copy,
> Host-to-Host network transfer, and Host-to-Device copy.
> 
> The implementation is suboptimal, especially for bulk data transfers, and can
> put significant strains on system resources, such as host memory bandwidth,
> PCIe bandwidth, etc. One important reason behind the current state is the
> kernel’s lack of semantics to express device to network transfers.
> 
> * Proposal:
> 
> In this patch series we attempt to optimize this use case by implementing
> socket APIs that enable the user to:
> 
> 1. send device memory across the network directly, and
> 2. receive incoming network packets directly into device memory.
> 
> Packet _payloads_ go directly from the NIC to device memory for receive and 
> from
> device memory to NIC for transmit.
> Packet _headers_ go to/from host memory and are processed by the TCP/IP stack
> normally. The NIC _must_ support header split to achieve this.
> 
> Advantages:
> 
> - Alleviate host memory bandwidth pressure, compared to existing
>  network-transfer + device-copy semantics.
> 
> - Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
>   of the PCIe tree, compared to traditional path which sends data through the
>   root complex.
> 
> * Patch overview:
> 
> ** Part 1: netlink API
> 
> Gives user ability to bind dma-buf to an RX queue.
> 
> ** Part 2: scatterlist support
> 
> Currently the standard for device memory sharing is DMABUF, which doesn't
> generate struct pages. On the other hand, networking stack (skbs, drivers, and
> page pool) operate on pages. We have 2 options:
> 
> 1. Generate struct pages for dmabuf device memory, or,
> 2. Modify the networking stack to process scatterlist.
> 
> Approach #1 was attempted in RFC v1. RFC v2 implements approach #2.
> 
> ** part 3: page pool support
> 
> We piggy back on page pool memory providers proposal:
> https://github.com/kuba-moo/linux/tree/pp-providers
> 
> It allows the page pool to define a memory provider that provides the
> page allocation and freeing. It helps abstract most of the device memory
> TCP changes from the driver.
> 
> ** part 4: support for unreadable skb frags
> 
> Page pool iovs are not accessible by the host; we implement changes
> throughput the networking stack to 

Re: [RFC PATCH v2 00/11] Device Memory TCP

2023-08-11 Thread Christian König

Am 10.08.23 um 20:44 schrieb Mina Almasry:

On Thu, Aug 10, 2023 at 3:29 AM Christian König
 wrote:

Am 10.08.23 um 03:57 schrieb Mina Almasry:

Changes in RFC v2:
--

The sticking point in RFC v1[1] was the dma-buf pages approach we used to
deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
that attempts to resolve this by implementing scatterlist support in the
networking stack, such that we can import the dma-buf scatterlist
directly.

Impressive work, I didn't thought that this would be possible that "easily".

Please note that we have considered replacing scatterlists with simple
arrays of DMA-addresses in the DMA-buf framework to avoid people trying
to access the struct page inside the scatterlist.


FWIW, I'm not doing anything with the struct pages inside the
scatterlist. All I need from the scatterlist are the
sg_dma_address(sg) and the sg_dma_len(sg), and I'm guessing the array
you're describing will provide exactly those, but let me know if I
misunderstood.


Your understanding is perfectly correct.




It might be a good idea to push for that first before this here is
finally implemented.

GPU drivers already convert the scatterlist used to arrays of
DMA-addresses as soon as they get them. This leaves RDMA and V4L as the
other two main users which would need to be converted.


   This is the approach proposed at a high level here[2].

Detailed changes:
1. Replaced dma-buf pages approach with importing scatterlist into the
 page pool.
2. Replace the dma-buf pages centric API with a netlink API.
3. Removed the TX path implementation - there is no issue with
 implementing the TX path with scatterlist approach, but leaving
 out the TX path makes it easier to review.
4. Functionality is tested with this proposal, but I have not conducted
 perf testing yet. I'm not sure there are regressions, but I removed
 perf claims from the cover letter until they can be re-confirmed.
5. Added Signed-off-by: contributors to the implementation.
6. Fixed some bugs with the RX path since RFC v1.

Any feedback welcome, but specifically the biggest pending questions
needing feedback IMO are:

1. Feedback on the scatterlist-based approach in general.

As far as I can see this sounds like the right thing to do in general.

Question is rather if we should stick with scatterlist, use array of
DMA-addresses or maybe even come up with a completely new structure.


As far as I can tell, it should be trivial to switch this device
memory TCP implementation to anything that provides:

1. DMA-addresses (sg_dma_address() equivalent)
2. lengths (sg_dma_len() equivalent)

if you go that route. Specifically, I think it will be pretty much a
localized change to netdev_bind_dmabuf_to_queue() implemented in this
patch:
https://lore.kernel.org/netdev/znulidzuvvyfy...@ziepe.ca/T/#m2d344b08f54562cc9155c3f5b018cbfaed96036f


Thanks, that's exactly what I wanted to hear.




2. Netlink API (Patch 1 & 2).

How does netlink manage the lifetime of objects?


Netlink itself doesn't handle the lifetime of the binding. However,
the API I implemented unbinds the dma-buf when the netlink socket is
destroyed. I do this so that even if the user process crashes or
forgets to unbind, the dma-buf will still be unbound once the netlink
socket is closed on the process exit. Details in this patch:
https://lore.kernel.org/netdev/znulidzuvvyfy...@ziepe.ca/T/#m2d344b08f54562cc9155c3f5b018cbfaed96036f


I need to double check the details, but at least of hand that sounds 
sufficient for the lifetime requirements of DMA-buf.


Thanks,
Christian.



On Thu, Aug 10, 2023 at 9:07 AM Jason Gunthorpe  wrote:

On Thu, Aug 10, 2023 at 12:29:08PM +0200, Christian König wrote:

Am 10.08.23 um 03:57 schrieb Mina Almasry:

Changes in RFC v2:
--

The sticking point in RFC v1[1] was the dma-buf pages approach we used to
deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
that attempts to resolve this by implementing scatterlist support in the
networking stack, such that we can import the dma-buf scatterlist
directly.

Impressive work, I didn't thought that this would be possible that "easily".

Please note that we have considered replacing scatterlists with simple
arrays of DMA-addresses in the DMA-buf framework to avoid people trying to
access the struct page inside the scatterlist.

It might be a good idea to push for that first before this here is finally
implemented.

GPU drivers already convert the scatterlist used to arrays of DMA-addresses
as soon as they get them. This leaves RDMA and V4L as the other two main
users which would need to be converted.

Oh that would be a nightmare for RDMA.

We need a standard based way to have scalable lists of DMA addresses :(


2. Netlink API (Patch 1 & 2).

How does netlink manage the lifetime of objects?

And access control..


Someone will correct me if I'm wrong but I'm not sure netlink itself
will do (sufficient) access control. 

Re: [RFC PATCH v2 00/11] Device Memory TCP

2023-08-10 Thread Mina Almasry
On Thu, Aug 10, 2023 at 11:58 AM Jason Gunthorpe  wrote:
>
> On Thu, Aug 10, 2023 at 11:44:53AM -0700, Mina Almasry wrote:
>
> > Someone will correct me if I'm wrong but I'm not sure netlink itself
> > will do (sufficient) access control. However I meant for the netlink
> > API to bind dma-bufs to be a CAP_NET_ADMIN API, and I forgot to add
> > this check in this proof-of-concept, sorry. I'll add a CAP_NET_ADMIN
> > check in netdev_bind_dmabuf_to_queue() in the next iteration.
>
> Can some other process that does not have the netlink fd manage to
> recv packets that were stored into the dmabuf?
>

The process needs to have the dma-buf fd to receive packets, and not
necessarily the netlink fd. It should be possible for:

1. a CAP_NET_ADMIN process to create a dma-buf, bind it using a
netlink fd, then share the dma-buf with another normal process that
receives packets on it.
2. a normal process to create a dma-buf, share it with a privileged
CAP_NET_ADMIN process that creates the binding via the netlink fd,
then the owner of the dma-buf can receive data on the dma-buf fd.
3. a CAP_NET_ADMIN creates the dma-buf and creates the binding itself
and receives data.

We in particular plan to use devmem TCP in the first mode, but this
detail is specific to us so I've largely neglected from describing it
in the cover letter. If our setup is interesting:
the CAP_NET_ADMIN process I describe in #1 is a 'tcpdevmem daemon'
which allocates the GPU memory, creates a dma-buf, creates an RX queue
binding, and shares the dma-buf with the ML application(s) running on
our instance. The ML application receives data on the dma-buf via
recvmsg().

The 'tcpdevmem daemon' takes care of the binding but also configures
RSS & flow steering. The dma-buf fd sharing happens over a unix domain
socket.

-- 
Thanks,
Mina


Re: [RFC PATCH v2 00/11] Device Memory TCP

2023-08-10 Thread Jason Gunthorpe
On Thu, Aug 10, 2023 at 11:44:53AM -0700, Mina Almasry wrote:

> Someone will correct me if I'm wrong but I'm not sure netlink itself
> will do (sufficient) access control. However I meant for the netlink
> API to bind dma-bufs to be a CAP_NET_ADMIN API, and I forgot to add
> this check in this proof-of-concept, sorry. I'll add a CAP_NET_ADMIN
> check in netdev_bind_dmabuf_to_queue() in the next iteration.

Can some other process that does not have the netlink fd manage to
recv packets that were stored into the dmabuf?

Jason


Re: [RFC PATCH v2 00/11] Device Memory TCP

2023-08-10 Thread Mina Almasry
On Thu, Aug 10, 2023 at 3:29 AM Christian König
 wrote:
>
> Am 10.08.23 um 03:57 schrieb Mina Almasry:
> > Changes in RFC v2:
> > --
> >
> > The sticking point in RFC v1[1] was the dma-buf pages approach we used to
> > deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
> > that attempts to resolve this by implementing scatterlist support in the
> > networking stack, such that we can import the dma-buf scatterlist
> > directly.
>
> Impressive work, I didn't thought that this would be possible that "easily".
>
> Please note that we have considered replacing scatterlists with simple
> arrays of DMA-addresses in the DMA-buf framework to avoid people trying
> to access the struct page inside the scatterlist.
>

FWIW, I'm not doing anything with the struct pages inside the
scatterlist. All I need from the scatterlist are the
sg_dma_address(sg) and the sg_dma_len(sg), and I'm guessing the array
you're describing will provide exactly those, but let me know if I
misunderstood.

> It might be a good idea to push for that first before this here is
> finally implemented.
>
> GPU drivers already convert the scatterlist used to arrays of
> DMA-addresses as soon as they get them. This leaves RDMA and V4L as the
> other two main users which would need to be converted.
>
> >   This is the approach proposed at a high level here[2].
> >
> > Detailed changes:
> > 1. Replaced dma-buf pages approach with importing scatterlist into the
> > page pool.
> > 2. Replace the dma-buf pages centric API with a netlink API.
> > 3. Removed the TX path implementation - there is no issue with
> > implementing the TX path with scatterlist approach, but leaving
> > out the TX path makes it easier to review.
> > 4. Functionality is tested with this proposal, but I have not conducted
> > perf testing yet. I'm not sure there are regressions, but I removed
> > perf claims from the cover letter until they can be re-confirmed.
> > 5. Added Signed-off-by: contributors to the implementation.
> > 6. Fixed some bugs with the RX path since RFC v1.
> >
> > Any feedback welcome, but specifically the biggest pending questions
> > needing feedback IMO are:
> >
> > 1. Feedback on the scatterlist-based approach in general.
>
> As far as I can see this sounds like the right thing to do in general.
>
> Question is rather if we should stick with scatterlist, use array of
> DMA-addresses or maybe even come up with a completely new structure.
>

As far as I can tell, it should be trivial to switch this device
memory TCP implementation to anything that provides:

1. DMA-addresses (sg_dma_address() equivalent)
2. lengths (sg_dma_len() equivalent)

if you go that route. Specifically, I think it will be pretty much a
localized change to netdev_bind_dmabuf_to_queue() implemented in this
patch:
https://lore.kernel.org/netdev/znulidzuvvyfy...@ziepe.ca/T/#m2d344b08f54562cc9155c3f5b018cbfaed96036f

> > 2. Netlink API (Patch 1 & 2).
>
> How does netlink manage the lifetime of objects?
>

Netlink itself doesn't handle the lifetime of the binding. However,
the API I implemented unbinds the dma-buf when the netlink socket is
destroyed. I do this so that even if the user process crashes or
forgets to unbind, the dma-buf will still be unbound once the netlink
socket is closed on the process exit. Details in this patch:
https://lore.kernel.org/netdev/znulidzuvvyfy...@ziepe.ca/T/#m2d344b08f54562cc9155c3f5b018cbfaed96036f

On Thu, Aug 10, 2023 at 9:07 AM Jason Gunthorpe  wrote:
>
> On Thu, Aug 10, 2023 at 12:29:08PM +0200, Christian König wrote:
> > Am 10.08.23 um 03:57 schrieb Mina Almasry:
> > > Changes in RFC v2:
> > > --
> > >
> > > The sticking point in RFC v1[1] was the dma-buf pages approach we used to
> > > deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
> > > that attempts to resolve this by implementing scatterlist support in the
> > > networking stack, such that we can import the dma-buf scatterlist
> > > directly.
> >
> > Impressive work, I didn't thought that this would be possible that "easily".
> >
> > Please note that we have considered replacing scatterlists with simple
> > arrays of DMA-addresses in the DMA-buf framework to avoid people trying to
> > access the struct page inside the scatterlist.
> >
> > It might be a good idea to push for that first before this here is finally
> > implemented.
> >
> > GPU drivers already convert the scatterlist used to arrays of DMA-addresses
> > as soon as they get them. This leaves RDMA and V4L as the other two main
> > users which would need to be converted.
>
> Oh that would be a nightmare for RDMA.
>
> We need a standard based way to have scalable lists of DMA addresses :(
>
> > > 2. Netlink API (Patch 1 & 2).
> >
> > How does netlink manage the lifetime of objects?
>
> And access control..
>

Someone will correct me if I'm wrong but I'm not sure netlink itself
will do (sufficient) access control. However I meant for the 

Re: [RFC PATCH v2 00/11] Device Memory TCP

2023-08-10 Thread Jason Gunthorpe
On Thu, Aug 10, 2023 at 12:29:08PM +0200, Christian König wrote:
> Am 10.08.23 um 03:57 schrieb Mina Almasry:
> > Changes in RFC v2:
> > --
> > 
> > The sticking point in RFC v1[1] was the dma-buf pages approach we used to
> > deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
> > that attempts to resolve this by implementing scatterlist support in the
> > networking stack, such that we can import the dma-buf scatterlist
> > directly.
> 
> Impressive work, I didn't thought that this would be possible that "easily".
> 
> Please note that we have considered replacing scatterlists with simple
> arrays of DMA-addresses in the DMA-buf framework to avoid people trying to
> access the struct page inside the scatterlist.
> 
> It might be a good idea to push for that first before this here is finally
> implemented.
> 
> GPU drivers already convert the scatterlist used to arrays of DMA-addresses
> as soon as they get them. This leaves RDMA and V4L as the other two main
> users which would need to be converted.

Oh that would be a nightmare for RDMA.

We need a standard based way to have scalable lists of DMA addresses :(

> > 2. Netlink API (Patch 1 & 2).
> 
> How does netlink manage the lifetime of objects?

And access control..

Jason


Re: [RFC PATCH v2 00/11] Device Memory TCP

2023-08-10 Thread Christian König

Am 10.08.23 um 03:57 schrieb Mina Almasry:

Changes in RFC v2:
--

The sticking point in RFC v1[1] was the dma-buf pages approach we used to
deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
that attempts to resolve this by implementing scatterlist support in the
networking stack, such that we can import the dma-buf scatterlist
directly.


Impressive work, I didn't thought that this would be possible that "easily".

Please note that we have considered replacing scatterlists with simple 
arrays of DMA-addresses in the DMA-buf framework to avoid people trying 
to access the struct page inside the scatterlist.


It might be a good idea to push for that first before this here is 
finally implemented.


GPU drivers already convert the scatterlist used to arrays of 
DMA-addresses as soon as they get them. This leaves RDMA and V4L as the 
other two main users which would need to be converted.



  This is the approach proposed at a high level here[2].

Detailed changes:
1. Replaced dma-buf pages approach with importing scatterlist into the
page pool.
2. Replace the dma-buf pages centric API with a netlink API.
3. Removed the TX path implementation - there is no issue with
implementing the TX path with scatterlist approach, but leaving
out the TX path makes it easier to review.
4. Functionality is tested with this proposal, but I have not conducted
perf testing yet. I'm not sure there are regressions, but I removed
perf claims from the cover letter until they can be re-confirmed.
5. Added Signed-off-by: contributors to the implementation.
6. Fixed some bugs with the RX path since RFC v1.

Any feedback welcome, but specifically the biggest pending questions
needing feedback IMO are:

1. Feedback on the scatterlist-based approach in general.


As far as I can see this sounds like the right thing to do in general.

Question is rather if we should stick with scatterlist, use array of 
DMA-addresses or maybe even come up with a completely new structure.



2. Netlink API (Patch 1 & 2).


How does netlink manage the lifetime of objects?


3. Approach to handle all the drivers that expect to receive pages from
the page pool (Patch 6).


Can't say anything about that. I know TCP/IP inside out, but I'm a GPU 
and not a network driver author.


Regards,
Christian.



[1] 
https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb...@gmail.com/T/
[2] 
https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXCc=h...@mail.gmail.com/

--

* TL;DR:

Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
from device memory efficiently, without bouncing the data to a host memory
buffer.

* Problem:

A large amount of data transfers have device memory as the source and/or
destination. Accelerators drastically increased the volume of such transfers.
Some examples include:
- ML accelerators transferring large amounts of training data from storage into
   GPU/TPU memory. In some cases ML training setup time can be as long as 50% of
   TPU compute time, improving data transfer throughput & efficiency can help
   improving GPU/TPU utilization.

- Distributed training, where ML accelerators, such as GPUs on different hosts,
   exchange data among them.

- Distributed raw block storage applications transfer large amounts of data with
   remote SSDs, much of this data does not require host processing.

Today, the majority of the Device-to-Device data transfers the network are
implemented as the following low level operations: Device-to-Host copy,
Host-to-Host network transfer, and Host-to-Device copy.

The implementation is suboptimal, especially for bulk data transfers, and can
put significant strains on system resources, such as host memory bandwidth,
PCIe bandwidth, etc. One important reason behind the current state is the
kernel’s lack of semantics to express device to network transfers.

* Proposal:

In this patch series we attempt to optimize this use case by implementing
socket APIs that enable the user to:

1. send device memory across the network directly, and
2. receive incoming network packets directly into device memory.

Packet _payloads_ go directly from the NIC to device memory for receive and from
device memory to NIC for transmit.
Packet _headers_ go to/from host memory and are processed by the TCP/IP stack
normally. The NIC _must_ support header split to achieve this.

Advantages:

- Alleviate host memory bandwidth pressure, compared to existing
  network-transfer + device-copy semantics.

- Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
   of the PCIe tree, compared to traditional path which sends data through the
   root complex.

* Patch overview:

** Part 1: netlink API

Gives user ability to bind dma-buf to an RX queue.

** Part 2: scatterlist support

Currently the standard for device memory sharing is DMABUF, which doesn't
generate struct pages. On the other 

[RFC PATCH v2 00/11] Device Memory TCP

2023-08-09 Thread Mina Almasry
Changes in RFC v2:
--

The sticking point in RFC v1[1] was the dma-buf pages approach we used to
deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
that attempts to resolve this by implementing scatterlist support in the
networking stack, such that we can import the dma-buf scatterlist
directly. This is the approach proposed at a high level here[2].

Detailed changes:
1. Replaced dma-buf pages approach with importing scatterlist into the
   page pool.
2. Replace the dma-buf pages centric API with a netlink API.
3. Removed the TX path implementation - there is no issue with
   implementing the TX path with scatterlist approach, but leaving
   out the TX path makes it easier to review.
4. Functionality is tested with this proposal, but I have not conducted
   perf testing yet. I'm not sure there are regressions, but I removed
   perf claims from the cover letter until they can be re-confirmed.
5. Added Signed-off-by: contributors to the implementation.
6. Fixed some bugs with the RX path since RFC v1.

Any feedback welcome, but specifically the biggest pending questions
needing feedback IMO are:

1. Feedback on the scatterlist-based approach in general.
2. Netlink API (Patch 1 & 2).
3. Approach to handle all the drivers that expect to receive pages from
   the page pool (Patch 6).

[1] 
https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb...@gmail.com/T/
[2] 
https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXCc=h...@mail.gmail.com/

--

* TL;DR:

Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
from device memory efficiently, without bouncing the data to a host memory
buffer.

* Problem:

A large amount of data transfers have device memory as the source and/or
destination. Accelerators drastically increased the volume of such transfers.
Some examples include:
- ML accelerators transferring large amounts of training data from storage into
  GPU/TPU memory. In some cases ML training setup time can be as long as 50% of
  TPU compute time, improving data transfer throughput & efficiency can help
  improving GPU/TPU utilization.

- Distributed training, where ML accelerators, such as GPUs on different hosts,
  exchange data among them.

- Distributed raw block storage applications transfer large amounts of data with
  remote SSDs, much of this data does not require host processing.

Today, the majority of the Device-to-Device data transfers the network are
implemented as the following low level operations: Device-to-Host copy,
Host-to-Host network transfer, and Host-to-Device copy.

The implementation is suboptimal, especially for bulk data transfers, and can
put significant strains on system resources, such as host memory bandwidth,
PCIe bandwidth, etc. One important reason behind the current state is the
kernel’s lack of semantics to express device to network transfers.

* Proposal:

In this patch series we attempt to optimize this use case by implementing
socket APIs that enable the user to:

1. send device memory across the network directly, and
2. receive incoming network packets directly into device memory.

Packet _payloads_ go directly from the NIC to device memory for receive and from
device memory to NIC for transmit.
Packet _headers_ go to/from host memory and are processed by the TCP/IP stack
normally. The NIC _must_ support header split to achieve this.

Advantages:

- Alleviate host memory bandwidth pressure, compared to existing
 network-transfer + device-copy semantics.

- Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
  of the PCIe tree, compared to traditional path which sends data through the
  root complex.

* Patch overview:

** Part 1: netlink API

Gives user ability to bind dma-buf to an RX queue.

** Part 2: scatterlist support

Currently the standard for device memory sharing is DMABUF, which doesn't
generate struct pages. On the other hand, networking stack (skbs, drivers, and
page pool) operate on pages. We have 2 options:

1. Generate struct pages for dmabuf device memory, or,
2. Modify the networking stack to process scatterlist.

Approach #1 was attempted in RFC v1. RFC v2 implements approach #2.

** part 3: page pool support

We piggy back on page pool memory providers proposal:
https://github.com/kuba-moo/linux/tree/pp-providers

It allows the page pool to define a memory provider that provides the
page allocation and freeing. It helps abstract most of the device memory
TCP changes from the driver.

** part 4: support for unreadable skb frags

Page pool iovs are not accessible by the host; we implement changes
throughput the networking stack to correctly handle skbs with unreadable
frags.

** Part 5: recvmsg() APIs

We define user APIs for the user to send and receive device memory.

Not included with this RFC is the GVE devmem TCP support, just to
simplify the review. Code available here if desired: