Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-14 Thread Pavel Begunkov

On 11/11/23 17:19, David Ahern wrote:

On 11/10/23 7:26 AM, Pavel Begunkov wrote:

On 11/7/23 23:03, Mina Almasry wrote:

On Tue, Nov 7, 2023 at 2:55 PM David Ahern  wrote:


On 11/7/23 3:10 PM, Mina Almasry wrote:

On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:


On 11/5/23 7:44 PM, Mina Almasry wrote:

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index eeeda849115c..1c351c138a5b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
   };

   #ifdef CONFIG_DMA_SHARED_BUFFER
+struct page_pool_iov *
+netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
+void netdev_free_devmem(struct page_pool_iov *ppiov);


netdev_{alloc,free}_dmabuf?



Can do.


I say that because a dmabuf can be host memory, at least I am not
aware
of a restriction that a dmabuf is device memory.



In my limited experience dma-buf is generally device memory, and
that's really its use case. CONFIG_UDMABUF is a driver that mocks
dma-buf with a memfd which I think is used for testing. But I can do
the rename, it's more clear anyway, I think.


config UDMABUF
  bool "userspace dmabuf misc driver"
  default n
  depends on DMA_SHARED_BUFFER
  depends on MEMFD_CREATE || COMPILE_TEST
  help
    A driver to let userspace turn memfd regions into dma-bufs.
    Qemu can use this to create host dmabufs for guest
framebuffers.


Qemu is just a userspace process; it is no way a special one.

Treating host memory as a dmabuf should radically simplify the io_uring
extension of this set.


I agree actually, and I was about to make that comment to David Wei's
series once I have the time.

David, your io_uring RX zerocopy proposal actually works with devmem
TCP, if you're inclined to do that instead, what you'd do roughly is
(I think):

That would be a Frankenstein's monster api with no good reason for it.


It brings a consistent API from a networking perspective.

io_uring should not need to be in the page pool and memory management
business. Have you or David coded up the re-use of the socket APIs with
dmabuf to see how much smaller it makes the io_uring change - or even
walked through from a theoretical perspective?


Yes, we did the mental exercise, which is why we're converting to pp.
I don't see many opportunities for reuse for the main data path,
potentially apart from using the iov format instead of pages.

If the goal is to minimise the amount of code, it can mimic the tcp
devmem api with netlink, ioctl-ish buffer return, but that'd be a
pretty bad api for io_uring, overly complicated and limiting
optimisation options. If not, then we have to do some buffer
management in io_uring, and I don't see anything wrong with that. It
shouldn't be a burden for networking if all that extra code is
contained in io_uring and only exposed via pp ops and following
the rules.

--
Pavel Begunkov


Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-11 Thread David Ahern
On 11/10/23 7:26 AM, Pavel Begunkov wrote:
> On 11/7/23 23:03, Mina Almasry wrote:
>> On Tue, Nov 7, 2023 at 2:55 PM David Ahern  wrote:
>>>
>>> On 11/7/23 3:10 PM, Mina Almasry wrote:
 On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:
>
> On 11/5/23 7:44 PM, Mina Almasry wrote:
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index eeeda849115c..1c351c138a5b 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>>   };
>>
>>   #ifdef CONFIG_DMA_SHARED_BUFFER
>> +struct page_pool_iov *
>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
>> +void netdev_free_devmem(struct page_pool_iov *ppiov);
>
> netdev_{alloc,free}_dmabuf?
>

 Can do.

> I say that because a dmabuf can be host memory, at least I am not
> aware
> of a restriction that a dmabuf is device memory.
>

 In my limited experience dma-buf is generally device memory, and
 that's really its use case. CONFIG_UDMABUF is a driver that mocks
 dma-buf with a memfd which I think is used for testing. But I can do
 the rename, it's more clear anyway, I think.
>>>
>>> config UDMABUF
>>>  bool "userspace dmabuf misc driver"
>>>  default n
>>>  depends on DMA_SHARED_BUFFER
>>>  depends on MEMFD_CREATE || COMPILE_TEST
>>>  help
>>>    A driver to let userspace turn memfd regions into dma-bufs.
>>>    Qemu can use this to create host dmabufs for guest
>>> framebuffers.
>>>
>>>
>>> Qemu is just a userspace process; it is no way a special one.
>>>
>>> Treating host memory as a dmabuf should radically simplify the io_uring
>>> extension of this set.
>>
>> I agree actually, and I was about to make that comment to David Wei's
>> series once I have the time.
>>
>> David, your io_uring RX zerocopy proposal actually works with devmem
>> TCP, if you're inclined to do that instead, what you'd do roughly is
>> (I think):
> That would be a Frankenstein's monster api with no good reason for it.

It brings a consistent API from a networking perspective.

io_uring should not need to be in the page pool and memory management
business. Have you or David coded up the re-use of the socket APIs with
dmabuf to see how much smaller it makes the io_uring change - or even
walked through from a theoretical perspective?




Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-10 Thread Pavel Begunkov

On 11/7/23 23:03, Mina Almasry wrote:

On Tue, Nov 7, 2023 at 2:55 PM David Ahern  wrote:


On 11/7/23 3:10 PM, Mina Almasry wrote:

On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:


On 11/5/23 7:44 PM, Mina Almasry wrote:

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index eeeda849115c..1c351c138a5b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
  };

  #ifdef CONFIG_DMA_SHARED_BUFFER
+struct page_pool_iov *
+netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
+void netdev_free_devmem(struct page_pool_iov *ppiov);


netdev_{alloc,free}_dmabuf?



Can do.


I say that because a dmabuf can be host memory, at least I am not aware
of a restriction that a dmabuf is device memory.



In my limited experience dma-buf is generally device memory, and
that's really its use case. CONFIG_UDMABUF is a driver that mocks
dma-buf with a memfd which I think is used for testing. But I can do
the rename, it's more clear anyway, I think.


config UDMABUF
 bool "userspace dmabuf misc driver"
 default n
 depends on DMA_SHARED_BUFFER
 depends on MEMFD_CREATE || COMPILE_TEST
 help
   A driver to let userspace turn memfd regions into dma-bufs.
   Qemu can use this to create host dmabufs for guest framebuffers.


Qemu is just a userspace process; it is no way a special one.

Treating host memory as a dmabuf should radically simplify the io_uring
extension of this set.


I agree actually, and I was about to make that comment to David Wei's
series once I have the time.

David, your io_uring RX zerocopy proposal actually works with devmem
TCP, if you're inclined to do that instead, what you'd do roughly is
(I think):

That would be a Frankenstein's monster api with no good reason for it.
You bind memory via netlink because you don't have a proper context to
work with otherwise, io_uring serves as the context with a separate and
precise abstraction around queues. Same with dmabufs, it totally makes
sense for device memory, but wrapping host memory into a file just to
immediately unwrap it back with no particular benefits from doing so
doesn't seem like a good uapi. Currently, the difference will be
hidden by io_uring.

And we'd still need to have a hook in pp's get page to grab buffers from
the buffer ring instead of refilling via SO_DEVMEM_DONTNEED and a
callback for when skbs are dropped. It's just instead of a new pp ops
it'll be a branch in the devmem path. io_uring might want to use the
added iov format in the future for device memory or even before that,
io_uring doesn't really care whether it's pages or not.

It's also my big concern from how many optimisations it'll fence us off.
With the current io_uring RFC I can get rid of all buffer atomic
refcounting and replace it with a single percpu counting per skb.
Hopefully, that will be doable after we place it on top of pp providers.



- Allocate a memfd,
- Use CONFIG_UDMABUF to create a dma-buf out of that memfd.
- Bind the dma-buf to the NIC using the netlink API in this RFC.
- Your io_uring extensions and io_uring uapi should work as-is almost
on top of this series, I think.

If you do this the incoming packets should land into your memfd, which
may or may not work for you. In the future if you feel inclined to use
device memory, this approach that I'm describing here would be more
extensible to device memory, because you'd already be using dma-bufs
for your user memory; you'd just replace one kind of dma-buf (UDMABUF)
with another.


That the io_uring set needs to dive into
page_pools is just wrong - complicating the design and code and pushing
io_uring into a realm it does not need to be involved in.


I disagree. How does it complicate it? io_uring will be just a yet another
provider implementing the callbacks of the API created for such use cases
and not changing common pp/net bits. The rest of code is in io_uring
implementing interaction with userspace and other usability features, but
there will be anyway some amount of code if we want to have a convenient
and performant api via io_uring.




Most (all?) of this patch set can work with any memory; only device
memory is unreadable.


--
Pavel Begunkov


Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-09 Thread Paolo Abeni
On Sun, 2023-11-05 at 18:44 -0800, Mina Almasry wrote:
[...]
> +void netdev_free_devmem(struct page_pool_iov *ppiov)
> +{
> + struct netdev_dmabuf_binding *binding = page_pool_iov_binding(ppiov);
> +
> + refcount_set(>refcount, 1);
> +
> + if (gen_pool_has_addr(binding->chunk_pool,
> +   page_pool_iov_dma_addr(ppiov), PAGE_SIZE))
> + gen_pool_free(binding->chunk_pool,
> +   page_pool_iov_dma_addr(ppiov), PAGE_SIZE);

Minor nit: what about caching the dma_addr value to make the above more
readable?

Cheers,

Paolo



Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-08 Thread Mina Almasry
> > On Mon, Nov 6, 2023 at 11:45 PM Yunsheng Lin  wrote:
> >>
> >> On 2023/11/6 10:44, Mina Almasry wrote:
> >>> +
> >>> +void netdev_free_devmem(struct page_pool_iov *ppiov)
> >>> +{
> >>> + struct netdev_dmabuf_binding *binding = 
> >>> page_pool_iov_binding(ppiov);
> >>> +
> >>> + refcount_set(>refcount, 1);
> >>> +
> >>> + if (gen_pool_has_addr(binding->chunk_pool,
> >>> +   page_pool_iov_dma_addr(ppiov), PAGE_SIZE))
> >>
> >> When gen_pool_has_addr() returns false, does it mean something has gone
> >> really wrong here?
> >>
> >
> > Yes, good eye. gen_pool_has_addr() should never return false, but then
> > again, gen_pool_free()  BUG_ON()s if it doesn't find the address,
> > which is an extremely severe reaction to what can be a minor bug in
> > the accounting. I prefer to leak rather than crash the machine. It's a
> > bit of defensive programming that is normally frowned upon, but I feel
> > like in this case it's maybe warranted due to the very severe reaction
> > (BUG_ON).
>
> I would argue that why is the above defensive programming not done in the
> gen_pool core:)
>

I think gen_pool is not really not that new, and suggesting removing
the BUG_ONs must have been proposed before and rejected. I'll try to
do some research and maybe suggest downgrading the BUG_ON to WARN_ON,
but my guess is there is some reason the maintainer wants it to be a
BUG_ON.

On Wed, Nov 8, 2023 at 5:00 PM David Wei  wrote:
>
> On 2023-11-07 14:55, David Ahern wrote:
> > On 11/7/23 3:10 PM, Mina Almasry wrote:
> >> On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:
> >>>
> >>> On 11/5/23 7:44 PM, Mina Almasry wrote:
>  diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>  index eeeda849115c..1c351c138a5b 100644
>  --- a/include/linux/netdevice.h
>  +++ b/include/linux/netdevice.h
>  @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>   };
> 
>   #ifdef CONFIG_DMA_SHARED_BUFFER
>  +struct page_pool_iov *
>  +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
>  +void netdev_free_devmem(struct page_pool_iov *ppiov);
> >>>
> >>> netdev_{alloc,free}_dmabuf?
> >>>
> >>
> >> Can do.
> >>
> >>> I say that because a dmabuf can be host memory, at least I am not aware
> >>> of a restriction that a dmabuf is device memory.
> >>>
> >>
> >> In my limited experience dma-buf is generally device memory, and
> >> that's really its use case. CONFIG_UDMABUF is a driver that mocks
> >> dma-buf with a memfd which I think is used for testing. But I can do
> >> the rename, it's more clear anyway, I think.
> >
> > config UDMABUF
> > bool "userspace dmabuf misc driver"
> > default n
> > depends on DMA_SHARED_BUFFER
> > depends on MEMFD_CREATE || COMPILE_TEST
> > help
> >   A driver to let userspace turn memfd regions into dma-bufs.
> >   Qemu can use this to create host dmabufs for guest framebuffers.
> >
> >
> > Qemu is just a userspace process; it is no way a special one.
> >
> > Treating host memory as a dmabuf should radically simplify the io_uring
> > extension of this set. That the io_uring set needs to dive into
> > page_pools is just wrong - complicating the design and code and pushing
> > io_uring into a realm it does not need to be involved in.
>
> I think our io_uring proposal will already be vastly simplified once we
> rebase onto Kuba's page pool memory provider API. Using udmabuf means
> depending on a driver designed for testing, vs io_uring's registered
> buffers API that's been tried and tested.
>

FWIW I also get an impression that udmabuf is mostly targeting
testing, but I'm not aware of any deficiency that makes it concretely
unsuitable for you. You be the judge.

The only quirk of udmabuf I'm aware of is that it seems to cap the max
dma-buf size to 16000 pages. Not sure if that's due to a genuine
technical limitation or just convenience.

> I don't have an intuitive understanding of the trade offs yet, and would
> need to try out udmabuf and compare vs say using our own page pool
> memory provider.
>


On Wed, Nov 8, 2023 at 5:15 PM David Wei  wrote:
> How would TCP devmem change if we no longer assume that dmabuf is device
> memory?

It wouldn't. The code already never assumes that dmabuf is device
memory. Any dma-buf should work, as far as I can tell. I'm also quite
confident udmabuf works, I use it for testing.

(Jason Gunthrope is much more of an expert and may chime in to say
'some dma-buf will not work'. My primitive understanding is that we're
using dma-bufs without any quirks and any dma-buf should work. I of
course haven't tested all dma-bufs :D)

> Pavel will know more on the perf side, but I wouldn't want to
> put any if/else on the hot path if we can avoid it. I could be wrong,
> but right now in my mind using different memory providers solves this
> neatly and the driver/networking stack doesn't need to care.
>
> Mina, I believe you said at NetDev conf 

Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-08 Thread David Wei
On 2023-11-07 15:03, Mina Almasry wrote:
> On Tue, Nov 7, 2023 at 2:55 PM David Ahern  wrote:
>>
>> On 11/7/23 3:10 PM, Mina Almasry wrote:
>>> On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:

 On 11/5/23 7:44 PM, Mina Almasry wrote:
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index eeeda849115c..1c351c138a5b 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>  };
>
>  #ifdef CONFIG_DMA_SHARED_BUFFER
> +struct page_pool_iov *
> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
> +void netdev_free_devmem(struct page_pool_iov *ppiov);

 netdev_{alloc,free}_dmabuf?

>>>
>>> Can do.
>>>
 I say that because a dmabuf can be host memory, at least I am not aware
 of a restriction that a dmabuf is device memory.

>>>
>>> In my limited experience dma-buf is generally device memory, and
>>> that's really its use case. CONFIG_UDMABUF is a driver that mocks
>>> dma-buf with a memfd which I think is used for testing. But I can do
>>> the rename, it's more clear anyway, I think.
>>
>> config UDMABUF
>> bool "userspace dmabuf misc driver"
>> default n
>> depends on DMA_SHARED_BUFFER
>> depends on MEMFD_CREATE || COMPILE_TEST
>> help
>>   A driver to let userspace turn memfd regions into dma-bufs.
>>   Qemu can use this to create host dmabufs for guest framebuffers.
>>
>>
>> Qemu is just a userspace process; it is no way a special one.
>>
>> Treating host memory as a dmabuf should radically simplify the io_uring
>> extension of this set.
> 
> I agree actually, and I was about to make that comment to David Wei's
> series once I have the time.
> 
> David, your io_uring RX zerocopy proposal actually works with devmem
> TCP, if you're inclined to do that instead, what you'd do roughly is
> (I think):
> 
> - Allocate a memfd,
> - Use CONFIG_UDMABUF to create a dma-buf out of that memfd.
> - Bind the dma-buf to the NIC using the netlink API in this RFC.
> - Your io_uring extensions and io_uring uapi should work as-is almost
> on top of this series, I think.
> 
> If you do this the incoming packets should land into your memfd, which
> may or may not work for you. In the future if you feel inclined to use
> device memory, this approach that I'm describing here would be more
> extensible to device memory, because you'd already be using dma-bufs
> for your user memory; you'd just replace one kind of dma-buf (UDMABUF)
> with another.
> 

How would TCP devmem change if we no longer assume that dmabuf is device
memory? Pavel will know more on the perf side, but I wouldn't want to
put any if/else on the hot path if we can avoid it. I could be wrong,
but right now in my mind using different memory providers solves this
neatly and the driver/networking stack doesn't need to care.

Mina, I believe you said at NetDev conf that you already had an udmabuf
implementation for testing. I would like to see this (you can send
privately) to see how TCP devmem would handle both user memory and
device memory.

>> That the io_uring set needs to dive into
>> page_pools is just wrong - complicating the design and code and pushing
>> io_uring into a realm it does not need to be involved in.
>>
>> Most (all?) of this patch set can work with any memory; only device
>> memory is unreadable.
>>
>>
> 
> 


Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-08 Thread David Wei
On 2023-11-07 14:55, David Ahern wrote:
> On 11/7/23 3:10 PM, Mina Almasry wrote:
>> On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:
>>>
>>> On 11/5/23 7:44 PM, Mina Almasry wrote:
 diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
 index eeeda849115c..1c351c138a5b 100644
 --- a/include/linux/netdevice.h
 +++ b/include/linux/netdevice.h
 @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
  };

  #ifdef CONFIG_DMA_SHARED_BUFFER
 +struct page_pool_iov *
 +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
 +void netdev_free_devmem(struct page_pool_iov *ppiov);
>>>
>>> netdev_{alloc,free}_dmabuf?
>>>
>>
>> Can do.
>>
>>> I say that because a dmabuf can be host memory, at least I am not aware
>>> of a restriction that a dmabuf is device memory.
>>>
>>
>> In my limited experience dma-buf is generally device memory, and
>> that's really its use case. CONFIG_UDMABUF is a driver that mocks
>> dma-buf with a memfd which I think is used for testing. But I can do
>> the rename, it's more clear anyway, I think.
> 
> config UDMABUF
> bool "userspace dmabuf misc driver"
> default n
> depends on DMA_SHARED_BUFFER
> depends on MEMFD_CREATE || COMPILE_TEST
> help
>   A driver to let userspace turn memfd regions into dma-bufs.
>   Qemu can use this to create host dmabufs for guest framebuffers.
> 
> 
> Qemu is just a userspace process; it is no way a special one.
> 
> Treating host memory as a dmabuf should radically simplify the io_uring
> extension of this set. That the io_uring set needs to dive into
> page_pools is just wrong - complicating the design and code and pushing
> io_uring into a realm it does not need to be involved in.

I think our io_uring proposal will already be vastly simplified once we
rebase onto Kuba's page pool memory provider API. Using udmabuf means
depending on a driver designed for testing, vs io_uring's registered
buffers API that's been tried and tested.

I don't have an intuitive understanding of the trade offs yet, and would
need to try out udmabuf and compare vs say using our own page pool
memory provider.

> 
> Most (all?) of this patch set can work with any memory; only device
> memory is unreadable.
> 
> 


Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-07 Thread Yunsheng Lin
On 2023/11/8 6:10, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:
>>
>> On 11/5/23 7:44 PM, Mina Almasry wrote:
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index eeeda849115c..1c351c138a5b 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>>>  };
>>>
>>>  #ifdef CONFIG_DMA_SHARED_BUFFER
>>> +struct page_pool_iov *
>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
>>> +void netdev_free_devmem(struct page_pool_iov *ppiov);
>>
>> netdev_{alloc,free}_dmabuf?
>>
> 
> Can do.
> 
>> I say that because a dmabuf can be host memory, at least I am not aware
>> of a restriction that a dmabuf is device memory.
>>
> 
> In my limited experience dma-buf is generally device memory, and
> that's really its use case. CONFIG_UDMABUF is a driver that mocks
> dma-buf with a memfd which I think is used for testing. But I can do
> the rename, it's more clear anyway, I think.
> 
> On Mon, Nov 6, 2023 at 11:45 PM Yunsheng Lin  wrote:
>>
>> On 2023/11/6 10:44, Mina Almasry wrote:
>>> +
>>> +void netdev_free_devmem(struct page_pool_iov *ppiov)
>>> +{
>>> + struct netdev_dmabuf_binding *binding = page_pool_iov_binding(ppiov);
>>> +
>>> + refcount_set(>refcount, 1);
>>> +
>>> + if (gen_pool_has_addr(binding->chunk_pool,
>>> +   page_pool_iov_dma_addr(ppiov), PAGE_SIZE))
>>
>> When gen_pool_has_addr() returns false, does it mean something has gone
>> really wrong here?
>>
> 
> Yes, good eye. gen_pool_has_addr() should never return false, but then
> again, gen_pool_free()  BUG_ON()s if it doesn't find the address,
> which is an extremely severe reaction to what can be a minor bug in
> the accounting. I prefer to leak rather than crash the machine. It's a
> bit of defensive programming that is normally frowned upon, but I feel
> like in this case it's maybe warranted due to the very severe reaction
> (BUG_ON).

I would argue that why is the above defensive programming not done in the
gen_pool core:)

> 


Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-07 Thread Mina Almasry
On Tue, Nov 7, 2023 at 2:55 PM David Ahern  wrote:
>
> On 11/7/23 3:10 PM, Mina Almasry wrote:
> > On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:
> >>
> >> On 11/5/23 7:44 PM, Mina Almasry wrote:
> >>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >>> index eeeda849115c..1c351c138a5b 100644
> >>> --- a/include/linux/netdevice.h
> >>> +++ b/include/linux/netdevice.h
> >>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
> >>>  };
> >>>
> >>>  #ifdef CONFIG_DMA_SHARED_BUFFER
> >>> +struct page_pool_iov *
> >>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
> >>> +void netdev_free_devmem(struct page_pool_iov *ppiov);
> >>
> >> netdev_{alloc,free}_dmabuf?
> >>
> >
> > Can do.
> >
> >> I say that because a dmabuf can be host memory, at least I am not aware
> >> of a restriction that a dmabuf is device memory.
> >>
> >
> > In my limited experience dma-buf is generally device memory, and
> > that's really its use case. CONFIG_UDMABUF is a driver that mocks
> > dma-buf with a memfd which I think is used for testing. But I can do
> > the rename, it's more clear anyway, I think.
>
> config UDMABUF
> bool "userspace dmabuf misc driver"
> default n
> depends on DMA_SHARED_BUFFER
> depends on MEMFD_CREATE || COMPILE_TEST
> help
>   A driver to let userspace turn memfd regions into dma-bufs.
>   Qemu can use this to create host dmabufs for guest framebuffers.
>
>
> Qemu is just a userspace process; it is no way a special one.
>
> Treating host memory as a dmabuf should radically simplify the io_uring
> extension of this set.

I agree actually, and I was about to make that comment to David Wei's
series once I have the time.

David, your io_uring RX zerocopy proposal actually works with devmem
TCP, if you're inclined to do that instead, what you'd do roughly is
(I think):

- Allocate a memfd,
- Use CONFIG_UDMABUF to create a dma-buf out of that memfd.
- Bind the dma-buf to the NIC using the netlink API in this RFC.
- Your io_uring extensions and io_uring uapi should work as-is almost
on top of this series, I think.

If you do this the incoming packets should land into your memfd, which
may or may not work for you. In the future if you feel inclined to use
device memory, this approach that I'm describing here would be more
extensible to device memory, because you'd already be using dma-bufs
for your user memory; you'd just replace one kind of dma-buf (UDMABUF)
with another.

> That the io_uring set needs to dive into
> page_pools is just wrong - complicating the design and code and pushing
> io_uring into a realm it does not need to be involved in.
>
> Most (all?) of this patch set can work with any memory; only device
> memory is unreadable.
>
>


-- 
Thanks,
Mina


Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-07 Thread David Ahern
On 11/7/23 3:10 PM, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:
>>
>> On 11/5/23 7:44 PM, Mina Almasry wrote:
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index eeeda849115c..1c351c138a5b 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>>>  };
>>>
>>>  #ifdef CONFIG_DMA_SHARED_BUFFER
>>> +struct page_pool_iov *
>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
>>> +void netdev_free_devmem(struct page_pool_iov *ppiov);
>>
>> netdev_{alloc,free}_dmabuf?
>>
> 
> Can do.
> 
>> I say that because a dmabuf can be host memory, at least I am not aware
>> of a restriction that a dmabuf is device memory.
>>
> 
> In my limited experience dma-buf is generally device memory, and
> that's really its use case. CONFIG_UDMABUF is a driver that mocks
> dma-buf with a memfd which I think is used for testing. But I can do
> the rename, it's more clear anyway, I think.

config UDMABUF
bool "userspace dmabuf misc driver"
default n
depends on DMA_SHARED_BUFFER
depends on MEMFD_CREATE || COMPILE_TEST
help
  A driver to let userspace turn memfd regions into dma-bufs.
  Qemu can use this to create host dmabufs for guest framebuffers.


Qemu is just a userspace process; it is no way a special one.

Treating host memory as a dmabuf should radically simplify the io_uring
extension of this set. That the io_uring set needs to dive into
page_pools is just wrong - complicating the design and code and pushing
io_uring into a realm it does not need to be involved in.

Most (all?) of this patch set can work with any memory; only device
memory is unreadable.




Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-07 Thread Mina Almasry
On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:
>
> On 11/5/23 7:44 PM, Mina Almasry wrote:
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index eeeda849115c..1c351c138a5b 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
> >  };
> >
> >  #ifdef CONFIG_DMA_SHARED_BUFFER
> > +struct page_pool_iov *
> > +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
> > +void netdev_free_devmem(struct page_pool_iov *ppiov);
>
> netdev_{alloc,free}_dmabuf?
>

Can do.

> I say that because a dmabuf can be host memory, at least I am not aware
> of a restriction that a dmabuf is device memory.
>

In my limited experience dma-buf is generally device memory, and
that's really its use case. CONFIG_UDMABUF is a driver that mocks
dma-buf with a memfd which I think is used for testing. But I can do
the rename, it's more clear anyway, I think.

On Mon, Nov 6, 2023 at 11:45 PM Yunsheng Lin  wrote:
>
> On 2023/11/6 10:44, Mina Almasry wrote:
> > +
> > +void netdev_free_devmem(struct page_pool_iov *ppiov)
> > +{
> > + struct netdev_dmabuf_binding *binding = page_pool_iov_binding(ppiov);
> > +
> > + refcount_set(>refcount, 1);
> > +
> > + if (gen_pool_has_addr(binding->chunk_pool,
> > +   page_pool_iov_dma_addr(ppiov), PAGE_SIZE))
>
> When gen_pool_has_addr() returns false, does it mean something has gone
> really wrong here?
>

Yes, good eye. gen_pool_has_addr() should never return false, but then
again, gen_pool_free()  BUG_ON()s if it doesn't find the address,
which is an extremely severe reaction to what can be a minor bug in
the accounting. I prefer to leak rather than crash the machine. It's a
bit of defensive programming that is normally frowned upon, but I feel
like in this case it's maybe warranted due to the very severe reaction
(BUG_ON).

-- 
Thanks,
Mina


Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-07 Thread Yunsheng Lin
On 2023/11/6 10:44, Mina Almasry wrote:
> +
> +void netdev_free_devmem(struct page_pool_iov *ppiov)
> +{
> + struct netdev_dmabuf_binding *binding = page_pool_iov_binding(ppiov);
> +
> + refcount_set(>refcount, 1);
> +
> + if (gen_pool_has_addr(binding->chunk_pool,
> +   page_pool_iov_dma_addr(ppiov), PAGE_SIZE))

When gen_pool_has_addr() returns false, does it mean something has gone
really wrong here?

> + gen_pool_free(binding->chunk_pool,
> +   page_pool_iov_dma_addr(ppiov), PAGE_SIZE);
> +
> + netdev_devmem_binding_put(binding);
> +}
> +
>  void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding)
>  {
>   struct netdev_rx_queue *rxq;
> 


Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-06 Thread David Ahern
On 11/5/23 7:44 PM, Mina Almasry wrote:
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index eeeda849115c..1c351c138a5b 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>  };
>  
>  #ifdef CONFIG_DMA_SHARED_BUFFER
> +struct page_pool_iov *
> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
> +void netdev_free_devmem(struct page_pool_iov *ppiov);

netdev_{alloc,free}_dmabuf?

I say that because a dmabuf can be host memory, at least I am not aware
of a restriction that a dmabuf is device memory.