Re: [RFC PATCH v2 06/11] page-pool: add device memory support

2023-09-07 Thread David Wei
On 19/08/2023 13:24, Mina Almasry wrote:
> On Sat, Aug 19, 2023 at 8:22 AM Jesper Dangaard Brouer
>  wrote:
>>
>>
>>
>> On 19/08/2023 16.08, Willem de Bruijn wrote:
>>> On Sat, Aug 19, 2023 at 5:51 AM Jesper Dangaard Brouer
>>>  wrote:



 On 10/08/2023 03.57, Mina Almasry wrote:
> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
>
> Refactor mm calls on struct page * into helpers, and add page_pool_iov
> handling on those helpers. Modify callers of these mm APIs with calls to
> these helpers instead.
>

 I don't like of this approach.
 This is adding code to the PP (page_pool) fast-path in multiple places.

 I've not had time to run my usual benchmarks, which are here:

 https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c

> 
> Thank you for linking that, I'll try to run these against the next submission.
> 
 But I'm sure it will affect performance.

 Regardless of performance, this approach is using ptr-LSB-bits, to hide
 that page-pointer are not really struct-pages, feels like force feeding
 a solution just to use the page_pool APIs.


> In areas where struct page* is dereferenced, add a check for special
> handling of page_pool_iov.
>
> The memory providers producing page_pool_iov can set the LSB on the
> struct page* returned to the page pool.
>
> Note that instead of overloading the LSB of page pointers, we can
> instead define a new union between struct page & struct page_pool_iov and
> compact it in a new type. However, we'd need to implement the code churn
> to modify the page_pool & drivers to use this new type. For this POC
> that is not implemented (feedback welcome).
>

 I've said before, that I prefer multiplexing on page->pp_magic.
 For your page_pool_iov the layout would have to match the offset of
 pp_magic, to do this. (And if insisting on using PP infra the refcnt
 would also need to align).
>>>
>>> Perhaps I misunderstand, but this suggests continuing to using
>>> struct page to demultiplex memory type?
>>>
>>
>> (Perhaps we are misunderstanding each-other and my use of the words
>> multiplexing and demultiplex are wrong, I'm sorry, as English isn't my
>> native language.)
>>
>> I do see the problem of depending on having a struct page, as the
>> page_pool_iov isn't related to struct page.  Having "page" in the name
>> of "page_pool_iov" is also confusing (hardest problem is CS is naming,
>> as we all know).
>>
>> To support more allocator types, perhaps skb->pp_recycle bit need to
>> grow another bit (and be renamed skb->recycle), so we can tell allocator
>> types apart, those that are page based and those whom are not.
>>
>>
>>> I think the feedback has been strong to not multiplex yet another
>>> memory type into that struct, that is not a real page. Which is why
>>> we went into this direction. This latest series limits the impact largely
>>> to networking structures and code.
>>>
>>
>> Some what related what I'm objecting to: the "page_pool_iov" is not a
>> real page, but this getting recycled into something called "page_pool",
>> which funny enough deals with struct-pages internally and depend on the
>> struct-page-refcnt.
>>
>> Given the approach changed way from using struct page, then I also don't
>> see the connection with the page_pool. Sorry.
>>
>>> One way or another, there will be a branch and multiplexing. Whether
>>> that is in struct page, the page pool or a new netdev mem type as you
>>> propose.
>>>
>>
>> I'm asking to have this branch/multiplexing done a the call sites.
>>
>> (IMHO not changing the drivers is a pipe-dream.)
>>
> 
> I think I understand what Jesper is saying. I think Jesper wants the
> page_pool to remain unchanged, and another layer on top of it to do
> the multiplexing, i.e.:
> 
> driver ---> new_layer (does multiplexing) ---> page_pool ---> mm layer
> \-->
> devmem_pool --> dma-buf layer
> 
> But, I think, Jakub wants the page_pool to be the front end, and for
> the multiplexing to happen in the page pool (I think, Jakub did not
> write this in an email, but this is how I interpret his comments from
> [1], and his memory provider RFC). So I think Jakub wants:
> 
> driver --> page_pool ---> memory_provider (does multiplexing) --->
> basic_provider ---> mm layer
> 
> \> devmem_provider --> dma-buf
> layer
> 
> That is the approach in this RFC.
> 
> I think proper naming that makes sense can be figured out, and is not
> a huge issue. I think in both cases we can minimize the changes to the
> drivers, maybe. In the first approach the driver will need to use the
> APIs created by the new layer. In the second approach, the driver
> continues to use page_pool APIs.
> 
> I think we need to converge on a path between 

Re: [RFC PATCH v2 06/11] page-pool: add device memory support

2023-08-22 Thread Mina Almasry
On Tue, Aug 22, 2023 at 5:24 AM Jesper Dangaard Brouer
 wrote:
>
>
>
> On 22/08/2023 08.05, Mina Almasry wrote:
> > On Sat, Aug 19, 2023 at 2:51 AM Jesper Dangaard Brouer
> >  wrote:
> >>
> >> On 10/08/2023 03.57, Mina Almasry wrote:
> >>> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> >>>
> >>> Refactor mm calls on struct page * into helpers, and add page_pool_iov
> >>> handling on those helpers. Modify callers of these mm APIs with calls to
> >>> these helpers instead.
> >>>
> >>
> >> I don't like of this approach.
> >> This is adding code to the PP (page_pool) fast-path in multiple places.
> >>
> >> I've not had time to run my usual benchmarks, which are here:
> >>
> >> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
> >>
> >
> > I ported over this benchmark to my tree and ran it, my results:
> >
>
> What CPU is this and GHz?  (I guess 2.6 GHz based on results).
>
> (It looks like this CPU is more efficient, instructions per cycles, than
> my E5-1650 v4 @ 3.60GHz).
>

cat /proc/cpuinfo
...
vendor_id   : GenuineIntel
cpu family  : 6
model   : 143
model name  : Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz
stepping: 8
microcode   : 0x
cpu MHz : 2699.998
```

This is a vCPU on the Google Cloud A3 VMs.

> > net-next @ b44693495af8
> > https://pastebin.com/raw/JuU7UQXe
> >
> > + Jakub's memory-provider APIs:
> > https://pastebin.com/raw/StMBhetn
> >
> > + devmem TCP changes:
> > https://pastebin.com/raw/mY1L6U4r
> >
>
> Only a single cycle slowdown for "page_pool01_fast_path".
>  From 10 cycles to 11 cycles.
>
> > + intentional regression just to make sure the benchmark is working:
> > https://pastebin.com/raw/wqWhcJdG
> >
> > I don't seem to be able to detect a regression with this series as-is,
> > but I'm not that familiar with the test and may be doing something
> > wrong or misinterpreting the results. Does this look ok to you?
> >
>
> The performance results are better than I expected.  The small
> regression from 10 cycles to 11 cycles is actually 10%, but I expect
> with some likely/unlikely instrumentation we can "likely" remove this again.
>

So the patch is already optimized carefully (I hope) to put all the
devmem processing in the default unlikely path. Willem showed me that:

if (page_pool_iov())
   return handle_page_pool_iov();

return handle_page();

The handle_page() will be 'likely' by default, which removes the need
for explicit likely/unlikely. I'm not sure we can get better perf with
explicit likely/unlikey, but I can try.

> So, this change actually looks acceptable from a performance PoV.
> I still think this page_pool_iov is very invasive to page_pool, but
> maybe it is better to hide this "uglyness" inside page_pool.
>
> The test primarily tests fast-path, and you also add "if" statements to
> all the DMA operations, which is not part of this benchmark.  Perhaps we
> can add unlikely statements, or inspect (objdump) the ASM to check code
> priorities the original page based "provider".
>
> >> But I'm sure it will affect performance.
> >>
>
> Guess, I was wrong ;-)
>
> --Jesper
>
>
> >> Regardless of performance, this approach is using ptr-LSB-bits, to hide
> >> that page-pointer are not really struct-pages, feels like force feeding
> >> a solution just to use the page_pool APIs.
> >>
> >>
> >>> In areas where struct page* is dereferenced, add a check for special
> >>> handling of page_pool_iov.
> >>>
> >>> The memory providers producing page_pool_iov can set the LSB on the
> >>> struct page* returned to the page pool.
> >>>
> >>> Note that instead of overloading the LSB of page pointers, we can
> >>> instead define a new union between struct page & struct page_pool_iov and
> >>> compact it in a new type. However, we'd need to implement the code churn
> >>> to modify the page_pool & drivers to use this new type. For this POC
> >>> that is not implemented (feedback welcome).
> >>>
> >>
> >> I've said before, that I prefer multiplexing on page->pp_magic.
> >> For your page_pool_iov the layout would have to match the offset of
> >> pp_magic, to do this. (And if insisting on using PP infra the refcnt
> >> would also need to align).
> >>
> >> On the allocation side, all drivers already use a driver helper
> >> page_pool_dev_alloc_pages() or we could add another (better named)
> >> helper to multiplex between other types of allocators, e.g. a devmem
> >> allocator.
> >>
> >> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
> >> could multiplex on pp_magic and call another API.  The API could be an
> >> extension to PP helpers, but it could also be a devmap allocator helper.
> >>
> >> IMHO forcing/piggy-bagging everything into page_pool is not the right
> >> solution.  I really think netstack need to support different allocator
> >> types. The page pool have been leading the way, yes, but perhaps it is
> >> time to add an API layer that e.g. 

Re: [RFC PATCH v2 06/11] page-pool: add device memory support

2023-08-22 Thread Jesper Dangaard Brouer




On 22/08/2023 08.05, Mina Almasry wrote:

On Sat, Aug 19, 2023 at 2:51 AM Jesper Dangaard Brouer
 wrote:


On 10/08/2023 03.57, Mina Almasry wrote:

Overload the LSB of struct page* to indicate that it's a page_pool_iov.

Refactor mm calls on struct page * into helpers, and add page_pool_iov
handling on those helpers. Modify callers of these mm APIs with calls to
these helpers instead.



I don't like of this approach.
This is adding code to the PP (page_pool) fast-path in multiple places.

I've not had time to run my usual benchmarks, which are here:

https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c



I ported over this benchmark to my tree and ran it, my results:



What CPU is this and GHz?  (I guess 2.6 GHz based on results).

(It looks like this CPU is more efficient, instructions per cycles, than 
my E5-1650 v4 @ 3.60GHz).



net-next @ b44693495af8
https://pastebin.com/raw/JuU7UQXe

+ Jakub's memory-provider APIs:
https://pastebin.com/raw/StMBhetn

+ devmem TCP changes:
https://pastebin.com/raw/mY1L6U4r



Only a single cycle slowdown for "page_pool01_fast_path".
From 10 cycles to 11 cycles.


+ intentional regression just to make sure the benchmark is working:
https://pastebin.com/raw/wqWhcJdG

I don't seem to be able to detect a regression with this series as-is,
but I'm not that familiar with the test and may be doing something
wrong or misinterpreting the results. Does this look ok to you?



The performance results are better than I expected.  The small
regression from 10 cycles to 11 cycles is actually 10%, but I expect
with some likely/unlikely instrumentation we can "likely" remove this again.

So, this change actually looks acceptable from a performance PoV.
I still think this page_pool_iov is very invasive to page_pool, but
maybe it is better to hide this "uglyness" inside page_pool.

The test primarily tests fast-path, and you also add "if" statements to
all the DMA operations, which is not part of this benchmark.  Perhaps we 
can add unlikely statements, or inspect (objdump) the ASM to check code 
priorities the original page based "provider".



But I'm sure it will affect performance.



Guess, I was wrong ;-)

--Jesper



Regardless of performance, this approach is using ptr-LSB-bits, to hide
that page-pointer are not really struct-pages, feels like force feeding
a solution just to use the page_pool APIs.



In areas where struct page* is dereferenced, add a check for special
handling of page_pool_iov.

The memory providers producing page_pool_iov can set the LSB on the
struct page* returned to the page pool.

Note that instead of overloading the LSB of page pointers, we can
instead define a new union between struct page & struct page_pool_iov and
compact it in a new type. However, we'd need to implement the code churn
to modify the page_pool & drivers to use this new type. For this POC
that is not implemented (feedback welcome).



I've said before, that I prefer multiplexing on page->pp_magic.
For your page_pool_iov the layout would have to match the offset of
pp_magic, to do this. (And if insisting on using PP infra the refcnt
would also need to align).

On the allocation side, all drivers already use a driver helper
page_pool_dev_alloc_pages() or we could add another (better named)
helper to multiplex between other types of allocators, e.g. a devmem
allocator.

On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
could multiplex on pp_magic and call another API.  The API could be an
extension to PP helpers, but it could also be a devmap allocator helper.

IMHO forcing/piggy-bagging everything into page_pool is not the right
solution.  I really think netstack need to support different allocator
types. The page pool have been leading the way, yes, but perhaps it is
time to add an API layer that e.g. could be named netmem, that gives us
the multiplexing between allocators.  In that process some of page_pool
APIs would be lifted out as common blocks and others remain.

--Jesper


I have a sample implementation of adding a new page_pool_token type
in the page_pool to give a general idea here:
https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d

Full branch here:
https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens

(In the branches above, page_pool_iov is called devmem_slice).

Could also add static_branch to speed up the checks in page_pool_iov
memory providers are being used.

Signed-off-by: Mina Almasry 
---
   include/net/page_pool.h | 74 ++-
   net/core/page_pool.c| 85 -
   2 files changed, 131 insertions(+), 28 deletions(-)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 537eb36115ed..f08ca230d68e 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -282,6 +282,64 @@ static inline struct page_pool_iov 
*page_to_page_pool_iov(struct page 

Re: [RFC PATCH v2 06/11] page-pool: add device memory support

2023-08-22 Thread Mina Almasry
On Sat, Aug 19, 2023 at 2:51 AM Jesper Dangaard Brouer
 wrote:
>
>
>
> On 10/08/2023 03.57, Mina Almasry wrote:
> > Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> >
> > Refactor mm calls on struct page * into helpers, and add page_pool_iov
> > handling on those helpers. Modify callers of these mm APIs with calls to
> > these helpers instead.
> >
>
> I don't like of this approach.
> This is adding code to the PP (page_pool) fast-path in multiple places.
>
> I've not had time to run my usual benchmarks, which are here:
>
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
>

I ported over this benchmark to my tree and ran it, my results:

net-next @ b44693495af8
https://pastebin.com/raw/JuU7UQXe

+ Jakub's memory-provider APIs:
https://pastebin.com/raw/StMBhetn

+ devmem TCP changes:
https://pastebin.com/raw/mY1L6U4r

+ intentional regression just to make sure the benchmark is working:
https://pastebin.com/raw/wqWhcJdG

I don't seem to be able to detect a regression with this series as-is,
but I'm not that familiar with the test and may be doing something
wrong or misinterpreting the results. Does this look ok to you?

> But I'm sure it will affect performance.
>
> Regardless of performance, this approach is using ptr-LSB-bits, to hide
> that page-pointer are not really struct-pages, feels like force feeding
> a solution just to use the page_pool APIs.
>
>
> > In areas where struct page* is dereferenced, add a check for special
> > handling of page_pool_iov.
> >
> > The memory providers producing page_pool_iov can set the LSB on the
> > struct page* returned to the page pool.
> >
> > Note that instead of overloading the LSB of page pointers, we can
> > instead define a new union between struct page & struct page_pool_iov and
> > compact it in a new type. However, we'd need to implement the code churn
> > to modify the page_pool & drivers to use this new type. For this POC
> > that is not implemented (feedback welcome).
> >
>
> I've said before, that I prefer multiplexing on page->pp_magic.
> For your page_pool_iov the layout would have to match the offset of
> pp_magic, to do this. (And if insisting on using PP infra the refcnt
> would also need to align).
>
> On the allocation side, all drivers already use a driver helper
> page_pool_dev_alloc_pages() or we could add another (better named)
> helper to multiplex between other types of allocators, e.g. a devmem
> allocator.
>
> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
> could multiplex on pp_magic and call another API.  The API could be an
> extension to PP helpers, but it could also be a devmap allocator helper.
>
> IMHO forcing/piggy-bagging everything into page_pool is not the right
> solution.  I really think netstack need to support different allocator
> types. The page pool have been leading the way, yes, but perhaps it is
> time to add an API layer that e.g. could be named netmem, that gives us
> the multiplexing between allocators.  In that process some of page_pool
> APIs would be lifted out as common blocks and others remain.
>
> --Jesper
>
> > I have a sample implementation of adding a new page_pool_token type
> > in the page_pool to give a general idea here:
> > https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
> >
> > Full branch here:
> > https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
> >
> > (In the branches above, page_pool_iov is called devmem_slice).
> >
> > Could also add static_branch to speed up the checks in page_pool_iov
> > memory providers are being used.
> >
> > Signed-off-by: Mina Almasry 
> > ---
> >   include/net/page_pool.h | 74 ++-
> >   net/core/page_pool.c| 85 -
> >   2 files changed, 131 insertions(+), 28 deletions(-)
> >
> > diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> > index 537eb36115ed..f08ca230d68e 100644
> > --- a/include/net/page_pool.h
> > +++ b/include/net/page_pool.h
> > @@ -282,6 +282,64 @@ static inline struct page_pool_iov 
> > *page_to_page_pool_iov(struct page *page)
> >   return NULL;
> >   }
> >
> > +static inline int page_pool_page_ref_count(struct page *page)
> > +{
> > + if (page_is_page_pool_iov(page))
> > + return page_pool_iov_refcount(page_to_page_pool_iov(page));
> > +
> > + return page_ref_count(page);
> > +}
> > +
> > +static inline void page_pool_page_get_many(struct page *page,
> > +unsigned int count)
> > +{
> > + if (page_is_page_pool_iov(page))
> > + return page_pool_iov_get_many(page_to_page_pool_iov(page),
> > +   count);
> > +
> > + return page_ref_add(page, count);
> > +}
> > +
> > +static inline void page_pool_page_put_many(struct page *page,
> > +unsigned int 

Re: [RFC PATCH v2 06/11] page-pool: add device memory support

2023-08-21 Thread Willem de Bruijn
On Mon, Aug 21, 2023 at 5:31 PM Jakub Kicinski  wrote:
>
> On Sat, 19 Aug 2023 12:12:16 -0400 Willem de Bruijn wrote:
> > :-) For the record, there is a prior version that added a separate type.
> >
> > I did not like the churn it brought and asked for this.
>
> It does end up looking cleaner that I personally expected, FWIW.
>
> > > Use of the LSB (or bits depending on alignment expectations) is a common
> > > trick and already done in quite a few places in the networking stack.
> > > This trick is essential to any realistic change here to incorporate gpu
> > > memory; way too much code will have unnecessary churn without it.
>
> We'll end up needing the LSB trick either way, right? The only question
> is whether the "if" is part of page pool or the caller of page pool.

Indeed. Adding layering does not remove this.

> Having seen zctap I'm afraid if we push this out of pp every provider
> will end up re-implementing page pool's recycling/caching functionality
> :(
>
> Maybe we need to "fork" the API? The device memory "ifs" are only needed
> for data pages. Which means that we can retain a faster, "if-less" API
> for headers and XDP. Or is that too much duplication?

I don't think that would be faster. Just a different structuring of
the code. We still need to take one of multiple paths for, say, page
allocation (page_pool_alloc_pages).

If having a struct page_pool and struct mem_pool, there would still be
a type-based branch. But now either in every caller (yech), or in some
thin shim layer. Why not just have it behind the existing API? That is
what your memory provider does. The only difference now is that one of
the providers really does not deal with pages.

I think this multiplexing does not have to introduce performance
regressions with well placed static_branch. Benchmarks and
side-by-side comparison of assembly will have to verify that.

Indirect function calls need to be avoided, of course, in favor of a
type based switch. And unless the non_default_providers static branch
is enabled, the default path is taken unconditionally.

If it no longer is a pure page pool, maybe it can be renamed. I
personally don't care.


Re: [RFC PATCH v2 06/11] page-pool: add device memory support

2023-08-21 Thread Jakub Kicinski
On Sat, 19 Aug 2023 12:12:16 -0400 Willem de Bruijn wrote:
> :-) For the record, there is a prior version that added a separate type.
> 
> I did not like the churn it brought and asked for this.

It does end up looking cleaner that I personally expected, FWIW.

> > Use of the LSB (or bits depending on alignment expectations) is a common
> > trick and already done in quite a few places in the networking stack.
> > This trick is essential to any realistic change here to incorporate gpu
> > memory; way too much code will have unnecessary churn without it.

We'll end up needing the LSB trick either way, right? The only question
is whether the "if" is part of page pool or the caller of page pool.

Having seen zctap I'm afraid if we push this out of pp every provider
will end up re-implementing page pool's recycling/caching functionality
:(

Maybe we need to "fork" the API? The device memory "ifs" are only needed
for data pages. Which means that we can retain a faster, "if-less" API
for headers and XDP. Or is that too much duplication?


Re: [RFC PATCH v2 06/11] page-pool: add device memory support

2023-08-19 Thread Willem de Bruijn
On Sat, Aug 19, 2023 at 11:49 AM David Ahern  wrote:
>
> On 8/19/23 9:22 AM, Jesper Dangaard Brouer wrote:
> >
> > I do see the problem of depending on having a struct page, as the
> > page_pool_iov isn't related to struct page.  Having "page" in the name
> > of "page_pool_iov" is also confusing (hardest problem is CS is naming,
> > as we all know).
> >
> > To support more allocator types, perhaps skb->pp_recycle bit need to
> > grow another bit (and be renamed skb->recycle), so we can tell allocator
> > types apart, those that are page based and those whom are not.
> >
> >
> >> I think the feedback has been strong to not multiplex yet another
> >> memory type into that struct, that is not a real page. Which is why
> >> we went into this direction. This latest series limits the impact largely
> >> to networking structures and code.
> >>
> >
> > Some what related what I'm objecting to: the "page_pool_iov" is not a
> > real page, but this getting recycled into something called "page_pool",
> > which funny enough deals with struct-pages internally and depend on the
> > struct-page-refcnt.
> >
> > Given the approach changed way from using struct page, then I also don't
> > see the connection with the page_pool. Sorry.
>
> I do not care for the page_pool_iov name either; I presumed it was least
> change to prove an idea and the name and details would evolve.
>
> How about something like buffer_pool or netdev_buf_pool that can operate
> with either pages, dma addresses, or something else in the future?

Sounds good. I suggested this name, but I see how using page in the
name is not very clear.

> >
> >> As for the LSB trick: that avoided adding a lot of boilerplate churn
> >> with new type and helper functions.
> >>
> >
> > Says the lazy programmer :-P ... sorry could not resist ;-)

:-) For the record, there is a prior version that added a separate type.

I did not like the churn it brought and asked for this.

>
> Use of the LSB (or bits depending on alignment expectations) is a common
> trick and already done in quite a few places in the networking stack.
> This trick is essential to any realistic change here to incorporate gpu
> memory; way too much code will have unnecessary churn without it.
>
> I do prefer my earlier suggestion though where the skb_frag_t has a
> union of relevant types though. Instead of `struct page *page` it could
> be `void *addr` with the helpers indicating page, iov, or other.

Okay. I think that is how we did it previously.


Re: [RFC PATCH v2 06/11] page-pool: add device memory support

2023-08-19 Thread Willem de Bruijn
On Sat, Aug 19, 2023 at 5:51 AM Jesper Dangaard Brouer
 wrote:
>
>
>
> On 10/08/2023 03.57, Mina Almasry wrote:
> > Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> >
> > Refactor mm calls on struct page * into helpers, and add page_pool_iov
> > handling on those helpers. Modify callers of these mm APIs with calls to
> > these helpers instead.
> >
>
> I don't like of this approach.
> This is adding code to the PP (page_pool) fast-path in multiple places.
>
> I've not had time to run my usual benchmarks, which are here:
>
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
>
> But I'm sure it will affect performance.
>
> Regardless of performance, this approach is using ptr-LSB-bits, to hide
> that page-pointer are not really struct-pages, feels like force feeding
> a solution just to use the page_pool APIs.
>
>
> > In areas where struct page* is dereferenced, add a check for special
> > handling of page_pool_iov.
> >
> > The memory providers producing page_pool_iov can set the LSB on the
> > struct page* returned to the page pool.
> >
> > Note that instead of overloading the LSB of page pointers, we can
> > instead define a new union between struct page & struct page_pool_iov and
> > compact it in a new type. However, we'd need to implement the code churn
> > to modify the page_pool & drivers to use this new type. For this POC
> > that is not implemented (feedback welcome).
> >
>
> I've said before, that I prefer multiplexing on page->pp_magic.
> For your page_pool_iov the layout would have to match the offset of
> pp_magic, to do this. (And if insisting on using PP infra the refcnt
> would also need to align).

Perhaps I misunderstand, but this suggests continuing to using
struct page to demultiplex memory type?

I think the feedback has been strong to not multiplex yet another
memory type into that struct, that is not a real page. Which is why
we went into this direction. This latest series limits the impact largely
to networking structures and code.

One way or another, there will be a branch and multiplexing. Whether
that is in struct page, the page pool or a new netdev mem type as you
propose.

Any regression in page pool can be avoided in the common case that
does not use device mem by placing that behind a static_branch. Would
that address your performance concerns?

>
> On the allocation side, all drivers already use a driver helper
> page_pool_dev_alloc_pages() or we could add another (better named)
> helper to multiplex between other types of allocators, e.g. a devmem
> allocator.
>
> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
> could multiplex on pp_magic and call another API.  The API could be an
> extension to PP helpers, but it could also be a devmap allocator helper.
>
> IMHO forcing/piggy-bagging everything into page_pool is not the right
> solution.  I really think netstack need to support different allocator
> types.

To me this is lifting page_pool into such a netstack alloctator pool.

Not sure adding another explicit layer of indirection would be cleaner
or faster (potentially more indirect calls).

As for the LSB trick: that avoided adding a lot of boilerplate churn
with new type and helper functions.



> The page pool have been leading the way, yes, but perhaps it is
> time to add an API layer that e.g. could be named netmem, that gives us
> the multiplexing between allocators.  In that process some of page_pool
> APIs would be lifted out as common blocks and others remain.
>
> --Jesper
>
> > I have a sample implementation of adding a new page_pool_token type
> > in the page_pool to give a general idea here:
> > https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
> >
> > Full branch here:
> > https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
> >
> > (In the branches above, page_pool_iov is called devmem_slice).
> >
> > Could also add static_branch to speed up the checks in page_pool_iov
> > memory providers are being used.
> >
> > Signed-off-by: Mina Almasry 
> > ---
> >   include/net/page_pool.h | 74 ++-
> >   net/core/page_pool.c| 85 -
> >   2 files changed, 131 insertions(+), 28 deletions(-)
> >
> > diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> > index 537eb36115ed..f08ca230d68e 100644
> > --- a/include/net/page_pool.h
> > +++ b/include/net/page_pool.h
> > @@ -282,6 +282,64 @@ static inline struct page_pool_iov 
> > *page_to_page_pool_iov(struct page *page)
> >   return NULL;
> >   }
> >
> > +static inline int page_pool_page_ref_count(struct page *page)
> > +{
> > + if (page_is_page_pool_iov(page))
> > + return page_pool_iov_refcount(page_to_page_pool_iov(page));
> > +
> > + return page_ref_count(page);
> > +}
> > +
> > +static inline void page_pool_page_get_many(struct page *page,
> > +

Re: [RFC PATCH v2 06/11] page-pool: add device memory support

2023-08-19 Thread Willem de Bruijn
> > Any regression in page pool can be avoided in the common case that
> > does not use device mem by placing that behind a static_branch. Would
> > that address your performance concerns?
> >
>
> No. This will not help.
>
> The problem is that every where in the page_pool code it is getting
> polluted with:
>
>if (page_is_page_pool_iov(page))
>  call-some-iov-func-instead()
>
> Like: the very central piece of getting the refcnt:
>
> +static inline int page_pool_page_ref_count(struct page *page)
> +{
> +   if (page_is_page_pool_iov(page))
> +   return page_pool_iov_refcount(page_to_page_pool_iov(page));
> +
> +   return page_ref_count(page);
> +}
>
>
> The fast-path of the PP is used for XDP_DROP scenarios, and is currently
> around 14 cycles (tsc). Thus, any extra code in this code patch will
> change the fast-path.

With static_branch disabled, it would only insert a NOP?


Re: [RFC PATCH v2 06/11] page-pool: add device memory support

2023-08-19 Thread Mina Almasry
On Sat, Aug 19, 2023 at 1:24 PM Mina Almasry  wrote:
>
> On Sat, Aug 19, 2023 at 8:22 AM Jesper Dangaard Brouer
>  wrote:
> >
> >
> >
> > On 19/08/2023 16.08, Willem de Bruijn wrote:
> > > On Sat, Aug 19, 2023 at 5:51 AM Jesper Dangaard Brouer
> > >  wrote:
> > >>
> > >>
> > >>
> > >> On 10/08/2023 03.57, Mina Almasry wrote:
> > >>> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> > >>>
> > >>> Refactor mm calls on struct page * into helpers, and add page_pool_iov
> > >>> handling on those helpers. Modify callers of these mm APIs with calls to
> > >>> these helpers instead.
> > >>>
> > >>
> > >> I don't like of this approach.
> > >> This is adding code to the PP (page_pool) fast-path in multiple places.
> > >>
> > >> I've not had time to run my usual benchmarks, which are here:
> > >>
> > >> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
> > >>
>
> Thank you for linking that, I'll try to run these against the next submission.
>
> > >> But I'm sure it will affect performance.
> > >>
> > >> Regardless of performance, this approach is using ptr-LSB-bits, to hide
> > >> that page-pointer are not really struct-pages, feels like force feeding
> > >> a solution just to use the page_pool APIs.
> > >>
> > >>
> > >>> In areas where struct page* is dereferenced, add a check for special
> > >>> handling of page_pool_iov.
> > >>>
> > >>> The memory providers producing page_pool_iov can set the LSB on the
> > >>> struct page* returned to the page pool.
> > >>>
> > >>> Note that instead of overloading the LSB of page pointers, we can
> > >>> instead define a new union between struct page & struct page_pool_iov 
> > >>> and
> > >>> compact it in a new type. However, we'd need to implement the code churn
> > >>> to modify the page_pool & drivers to use this new type. For this POC
> > >>> that is not implemented (feedback welcome).
> > >>>
> > >>
> > >> I've said before, that I prefer multiplexing on page->pp_magic.
> > >> For your page_pool_iov the layout would have to match the offset of
> > >> pp_magic, to do this. (And if insisting on using PP infra the refcnt
> > >> would also need to align).
> > >
> > > Perhaps I misunderstand, but this suggests continuing to using
> > > struct page to demultiplex memory type?
> > >
> >
> > (Perhaps we are misunderstanding each-other and my use of the words
> > multiplexing and demultiplex are wrong, I'm sorry, as English isn't my
> > native language.)
> >
> > I do see the problem of depending on having a struct page, as the
> > page_pool_iov isn't related to struct page.  Having "page" in the name
> > of "page_pool_iov" is also confusing (hardest problem is CS is naming,
> > as we all know).
> >
> > To support more allocator types, perhaps skb->pp_recycle bit need to
> > grow another bit (and be renamed skb->recycle), so we can tell allocator
> > types apart, those that are page based and those whom are not.
> >
> >
> > > I think the feedback has been strong to not multiplex yet another
> > > memory type into that struct, that is not a real page. Which is why
> > > we went into this direction. This latest series limits the impact largely
> > > to networking structures and code.
> > >
> >
> > Some what related what I'm objecting to: the "page_pool_iov" is not a
> > real page, but this getting recycled into something called "page_pool",
> > which funny enough deals with struct-pages internally and depend on the
> > struct-page-refcnt.
> >
> > Given the approach changed way from using struct page, then I also don't
> > see the connection with the page_pool. Sorry.
> >
> > > One way or another, there will be a branch and multiplexing. Whether
> > > that is in struct page, the page pool or a new netdev mem type as you
> > > propose.
> > >
> >
> > I'm asking to have this branch/multiplexing done a the call sites.
> >
> > (IMHO not changing the drivers is a pipe-dream.)
> >
>
> I think I understand what Jesper is saying. I think Jesper wants the
> page_pool to remain unchanged, and another layer on top of it to do
> the multiplexing, i.e.:
>
> driver ---> new_layer (does multiplexing) ---> page_pool ---> mm layer
> \-->
> devmem_pool --> dma-buf layer
>

Gmail mangled this :/ Let me try again. This should read:

driver -> new_layer -> page_pool --> mm layer
  \->devmem_pool -> dma-buf

> But, I think, Jakub wants the page_pool to be the front end, and for
> the multiplexing to happen in the page pool (I think, Jakub did not
> write this in an email, but this is how I interpret his comments from
> [1], and his memory provider RFC). So I think Jakub wants:
>
> driver --> page_pool ---> memory_provider (does multiplexing) --->
> basic_provider ---> mm layer
>
> \> devmem_provider --> dma-buf
> layer
>

This should read:
driver -> pp -> memory provider -> basic provider -> mm
   

Re: [RFC PATCH v2 06/11] page-pool: add device memory support

2023-08-19 Thread Mina Almasry
On Sat, Aug 19, 2023 at 8:22 AM Jesper Dangaard Brouer
 wrote:
>
>
>
> On 19/08/2023 16.08, Willem de Bruijn wrote:
> > On Sat, Aug 19, 2023 at 5:51 AM Jesper Dangaard Brouer
> >  wrote:
> >>
> >>
> >>
> >> On 10/08/2023 03.57, Mina Almasry wrote:
> >>> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> >>>
> >>> Refactor mm calls on struct page * into helpers, and add page_pool_iov
> >>> handling on those helpers. Modify callers of these mm APIs with calls to
> >>> these helpers instead.
> >>>
> >>
> >> I don't like of this approach.
> >> This is adding code to the PP (page_pool) fast-path in multiple places.
> >>
> >> I've not had time to run my usual benchmarks, which are here:
> >>
> >> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
> >>

Thank you for linking that, I'll try to run these against the next submission.

> >> But I'm sure it will affect performance.
> >>
> >> Regardless of performance, this approach is using ptr-LSB-bits, to hide
> >> that page-pointer are not really struct-pages, feels like force feeding
> >> a solution just to use the page_pool APIs.
> >>
> >>
> >>> In areas where struct page* is dereferenced, add a check for special
> >>> handling of page_pool_iov.
> >>>
> >>> The memory providers producing page_pool_iov can set the LSB on the
> >>> struct page* returned to the page pool.
> >>>
> >>> Note that instead of overloading the LSB of page pointers, we can
> >>> instead define a new union between struct page & struct page_pool_iov and
> >>> compact it in a new type. However, we'd need to implement the code churn
> >>> to modify the page_pool & drivers to use this new type. For this POC
> >>> that is not implemented (feedback welcome).
> >>>
> >>
> >> I've said before, that I prefer multiplexing on page->pp_magic.
> >> For your page_pool_iov the layout would have to match the offset of
> >> pp_magic, to do this. (And if insisting on using PP infra the refcnt
> >> would also need to align).
> >
> > Perhaps I misunderstand, but this suggests continuing to using
> > struct page to demultiplex memory type?
> >
>
> (Perhaps we are misunderstanding each-other and my use of the words
> multiplexing and demultiplex are wrong, I'm sorry, as English isn't my
> native language.)
>
> I do see the problem of depending on having a struct page, as the
> page_pool_iov isn't related to struct page.  Having "page" in the name
> of "page_pool_iov" is also confusing (hardest problem is CS is naming,
> as we all know).
>
> To support more allocator types, perhaps skb->pp_recycle bit need to
> grow another bit (and be renamed skb->recycle), so we can tell allocator
> types apart, those that are page based and those whom are not.
>
>
> > I think the feedback has been strong to not multiplex yet another
> > memory type into that struct, that is not a real page. Which is why
> > we went into this direction. This latest series limits the impact largely
> > to networking structures and code.
> >
>
> Some what related what I'm objecting to: the "page_pool_iov" is not a
> real page, but this getting recycled into something called "page_pool",
> which funny enough deals with struct-pages internally and depend on the
> struct-page-refcnt.
>
> Given the approach changed way from using struct page, then I also don't
> see the connection with the page_pool. Sorry.
>
> > One way or another, there will be a branch and multiplexing. Whether
> > that is in struct page, the page pool or a new netdev mem type as you
> > propose.
> >
>
> I'm asking to have this branch/multiplexing done a the call sites.
>
> (IMHO not changing the drivers is a pipe-dream.)
>

I think I understand what Jesper is saying. I think Jesper wants the
page_pool to remain unchanged, and another layer on top of it to do
the multiplexing, i.e.:

driver ---> new_layer (does multiplexing) ---> page_pool ---> mm layer
\-->
devmem_pool --> dma-buf layer

But, I think, Jakub wants the page_pool to be the front end, and for
the multiplexing to happen in the page pool (I think, Jakub did not
write this in an email, but this is how I interpret his comments from
[1], and his memory provider RFC). So I think Jakub wants:

driver --> page_pool ---> memory_provider (does multiplexing) --->
basic_provider ---> mm layer

\> devmem_provider --> dma-buf
layer

That is the approach in this RFC.

I think proper naming that makes sense can be figured out, and is not
a huge issue. I think in both cases we can minimize the changes to the
drivers, maybe. In the first approach the driver will need to use the
APIs created by the new layer. In the second approach, the driver
continues to use page_pool APIs.

I think we need to converge on a path between the 2 approaches (or
maybe 3rd approach to do). For me the pros/cons of each approach
(please add):

multiplexing in new_layer:
- Pro: maybe 

Re: [RFC PATCH v2 06/11] page-pool: add device memory support

2023-08-19 Thread David Ahern
On 8/19/23 9:22 AM, Jesper Dangaard Brouer wrote:
> 
> I do see the problem of depending on having a struct page, as the
> page_pool_iov isn't related to struct page.  Having "page" in the name
> of "page_pool_iov" is also confusing (hardest problem is CS is naming,
> as we all know).
> 
> To support more allocator types, perhaps skb->pp_recycle bit need to
> grow another bit (and be renamed skb->recycle), so we can tell allocator
> types apart, those that are page based and those whom are not.
> 
> 
>> I think the feedback has been strong to not multiplex yet another
>> memory type into that struct, that is not a real page. Which is why
>> we went into this direction. This latest series limits the impact largely
>> to networking structures and code.
>>
> 
> Some what related what I'm objecting to: the "page_pool_iov" is not a
> real page, but this getting recycled into something called "page_pool",
> which funny enough deals with struct-pages internally and depend on the
> struct-page-refcnt.
> 
> Given the approach changed way from using struct page, then I also don't
> see the connection with the page_pool. Sorry.

I do not care for the page_pool_iov name either; I presumed it was least
change to prove an idea and the name and details would evolve.

How about something like buffer_pool or netdev_buf_pool that can operate
with either pages, dma addresses, or something else in the future?

> 
>> As for the LSB trick: that avoided adding a lot of boilerplate churn
>> with new type and helper functions.
>>
> 
> Says the lazy programmer :-P ... sorry could not resist ;-)

Use of the LSB (or bits depending on alignment expectations) is a common
trick and already done in quite a few places in the networking stack.
This trick is essential to any realistic change here to incorporate gpu
memory; way too much code will have unnecessary churn without it.

I do prefer my earlier suggestion though where the skb_frag_t has a
union of relevant types though. Instead of `struct page *page` it could
be `void *addr` with the helpers indicating page, iov, or other.



Re: [RFC PATCH v2 06/11] page-pool: add device memory support

2023-08-19 Thread Jesper Dangaard Brouer




On 19/08/2023 16.08, Willem de Bruijn wrote:

On Sat, Aug 19, 2023 at 5:51 AM Jesper Dangaard Brouer
 wrote:




On 10/08/2023 03.57, Mina Almasry wrote:

Overload the LSB of struct page* to indicate that it's a page_pool_iov.

Refactor mm calls on struct page * into helpers, and add page_pool_iov
handling on those helpers. Modify callers of these mm APIs with calls to
these helpers instead.



I don't like of this approach.
This is adding code to the PP (page_pool) fast-path in multiple places.

I've not had time to run my usual benchmarks, which are here:

https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c

But I'm sure it will affect performance.

Regardless of performance, this approach is using ptr-LSB-bits, to hide
that page-pointer are not really struct-pages, feels like force feeding
a solution just to use the page_pool APIs.



In areas where struct page* is dereferenced, add a check for special
handling of page_pool_iov.

The memory providers producing page_pool_iov can set the LSB on the
struct page* returned to the page pool.

Note that instead of overloading the LSB of page pointers, we can
instead define a new union between struct page & struct page_pool_iov and
compact it in a new type. However, we'd need to implement the code churn
to modify the page_pool & drivers to use this new type. For this POC
that is not implemented (feedback welcome).



I've said before, that I prefer multiplexing on page->pp_magic.
For your page_pool_iov the layout would have to match the offset of
pp_magic, to do this. (And if insisting on using PP infra the refcnt
would also need to align).


Perhaps I misunderstand, but this suggests continuing to using
struct page to demultiplex memory type?



(Perhaps we are misunderstanding each-other and my use of the words 
multiplexing and demultiplex are wrong, I'm sorry, as English isn't my 
native language.)


I do see the problem of depending on having a struct page, as the 
page_pool_iov isn't related to struct page.  Having "page" in the name 
of "page_pool_iov" is also confusing (hardest problem is CS is naming, 
as we all know).


To support more allocator types, perhaps skb->pp_recycle bit need to 
grow another bit (and be renamed skb->recycle), so we can tell allocator 
types apart, those that are page based and those whom are not.




I think the feedback has been strong to not multiplex yet another
memory type into that struct, that is not a real page. Which is why
we went into this direction. This latest series limits the impact largely
to networking structures and code.



Some what related what I'm objecting to: the "page_pool_iov" is not a 
real page, but this getting recycled into something called "page_pool", 
which funny enough deals with struct-pages internally and depend on the 
struct-page-refcnt.


Given the approach changed way from using struct page, then I also don't 
see the connection with the page_pool. Sorry.



One way or another, there will be a branch and multiplexing. Whether
that is in struct page, the page pool or a new netdev mem type as you
propose.



I'm asking to have this branch/multiplexing done a the call sites.

(IMHO not changing the drivers is a pipe-dream.)


Any regression in page pool can be avoided in the common case that
does not use device mem by placing that behind a static_branch. Would
that address your performance concerns?



No. This will not help.

The problem is that every where in the page_pool code it is getting 
polluted with:


  if (page_is_page_pool_iov(page))
call-some-iov-func-instead()

Like: the very central piece of getting the refcnt:

+static inline int page_pool_page_ref_count(struct page *page)
+{
+   if (page_is_page_pool_iov(page))
+   return page_pool_iov_refcount(page_to_page_pool_iov(page));
+
+   return page_ref_count(page);
+}


The fast-path of the PP is used for XDP_DROP scenarios, and is currently 
around 14 cycles (tsc). Thus, any extra code in this code patch will 
change the fast-path.





On the allocation side, all drivers already use a driver helper
page_pool_dev_alloc_pages() or we could add another (better named)
helper to multiplex between other types of allocators, e.g. a devmem
allocator.

On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
could multiplex on pp_magic and call another API.  The API could be an
extension to PP helpers, but it could also be a devmap allocator helper.

IMHO forcing/piggy-bagging everything into page_pool is not the right
solution.  I really think netstack need to support different allocator
types.


To me this is lifting page_pool into such a netstack alloctator pool.



This is should be renamed as it is not longer dealing with pages.


Not sure adding another explicit layer of indirection would be cleaner
or faster (potentially more indirect calls).



It seems we are talking past each-other.  The layer of indirection I'm 
talking about is likely a 

Re: [RFC PATCH v2 06/11] page-pool: add device memory support

2023-08-19 Thread Jesper Dangaard Brouer




On 10/08/2023 03.57, Mina Almasry wrote:

Overload the LSB of struct page* to indicate that it's a page_pool_iov.

Refactor mm calls on struct page * into helpers, and add page_pool_iov
handling on those helpers. Modify callers of these mm APIs with calls to
these helpers instead.



I don't like of this approach.
This is adding code to the PP (page_pool) fast-path in multiple places.

I've not had time to run my usual benchmarks, which are here:

https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c

But I'm sure it will affect performance.

Regardless of performance, this approach is using ptr-LSB-bits, to hide
that page-pointer are not really struct-pages, feels like force feeding
a solution just to use the page_pool APIs.



In areas where struct page* is dereferenced, add a check for special
handling of page_pool_iov.

The memory providers producing page_pool_iov can set the LSB on the
struct page* returned to the page pool.

Note that instead of overloading the LSB of page pointers, we can
instead define a new union between struct page & struct page_pool_iov and
compact it in a new type. However, we'd need to implement the code churn
to modify the page_pool & drivers to use this new type. For this POC
that is not implemented (feedback welcome).



I've said before, that I prefer multiplexing on page->pp_magic.
For your page_pool_iov the layout would have to match the offset of
pp_magic, to do this. (And if insisting on using PP infra the refcnt
would also need to align).

On the allocation side, all drivers already use a driver helper
page_pool_dev_alloc_pages() or we could add another (better named)
helper to multiplex between other types of allocators, e.g. a devmem
allocator.

On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
could multiplex on pp_magic and call another API.  The API could be an
extension to PP helpers, but it could also be a devmap allocator helper.

IMHO forcing/piggy-bagging everything into page_pool is not the right 
solution.  I really think netstack need to support different allocator 
types. The page pool have been leading the way, yes, but perhaps it is 
time to add an API layer that e.g. could be named netmem, that gives us 
the multiplexing between allocators.  In that process some of page_pool 
APIs would be lifted out as common blocks and others remain.


--Jesper


I have a sample implementation of adding a new page_pool_token type
in the page_pool to give a general idea here:
https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d

Full branch here:
https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens

(In the branches above, page_pool_iov is called devmem_slice).

Could also add static_branch to speed up the checks in page_pool_iov
memory providers are being used.

Signed-off-by: Mina Almasry 
---
  include/net/page_pool.h | 74 ++-
  net/core/page_pool.c| 85 -
  2 files changed, 131 insertions(+), 28 deletions(-)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 537eb36115ed..f08ca230d68e 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -282,6 +282,64 @@ static inline struct page_pool_iov 
*page_to_page_pool_iov(struct page *page)
return NULL;
  }

+static inline int page_pool_page_ref_count(struct page *page)
+{
+   if (page_is_page_pool_iov(page))
+   return page_pool_iov_refcount(page_to_page_pool_iov(page));
+
+   return page_ref_count(page);
+}
+
+static inline void page_pool_page_get_many(struct page *page,
+  unsigned int count)
+{
+   if (page_is_page_pool_iov(page))
+   return page_pool_iov_get_many(page_to_page_pool_iov(page),
+ count);
+
+   return page_ref_add(page, count);
+}
+
+static inline void page_pool_page_put_many(struct page *page,
+  unsigned int count)
+{
+   if (page_is_page_pool_iov(page))
+   return page_pool_iov_put_many(page_to_page_pool_iov(page),
+ count);
+
+   if (count > 1)
+   page_ref_sub(page, count - 1);
+
+   put_page(page);
+}
+
+static inline bool page_pool_page_is_pfmemalloc(struct page *page)
+{
+   if (page_is_page_pool_iov(page))
+   return false;
+
+   return page_is_pfmemalloc(page);
+}
+
+static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
+{
+   /* Assume page_pool_iov are on the preferred node without actually
+* checking...
+*
+* This check is only used to check for recycling memory in the page
+* pool's fast paths. Currently the only implementation of page_pool_iov
+* is dmabuf device memory. It's a deliberate decision by the user to
+

[RFC PATCH v2 06/11] page-pool: add device memory support

2023-08-09 Thread Mina Almasry
Overload the LSB of struct page* to indicate that it's a page_pool_iov.

Refactor mm calls on struct page * into helpers, and add page_pool_iov
handling on those helpers. Modify callers of these mm APIs with calls to
these helpers instead.

In areas where struct page* is dereferenced, add a check for special
handling of page_pool_iov.

The memory providers producing page_pool_iov can set the LSB on the
struct page* returned to the page pool.

Note that instead of overloading the LSB of page pointers, we can
instead define a new union between struct page & struct page_pool_iov and
compact it in a new type. However, we'd need to implement the code churn
to modify the page_pool & drivers to use this new type. For this POC
that is not implemented (feedback welcome).

I have a sample implementation of adding a new page_pool_token type
in the page_pool to give a general idea here:
https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d

Full branch here:
https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens

(In the branches above, page_pool_iov is called devmem_slice).

Could also add static_branch to speed up the checks in page_pool_iov
memory providers are being used.

Signed-off-by: Mina Almasry 
---
 include/net/page_pool.h | 74 ++-
 net/core/page_pool.c| 85 -
 2 files changed, 131 insertions(+), 28 deletions(-)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 537eb36115ed..f08ca230d68e 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -282,6 +282,64 @@ static inline struct page_pool_iov 
*page_to_page_pool_iov(struct page *page)
return NULL;
 }

+static inline int page_pool_page_ref_count(struct page *page)
+{
+   if (page_is_page_pool_iov(page))
+   return page_pool_iov_refcount(page_to_page_pool_iov(page));
+
+   return page_ref_count(page);
+}
+
+static inline void page_pool_page_get_many(struct page *page,
+  unsigned int count)
+{
+   if (page_is_page_pool_iov(page))
+   return page_pool_iov_get_many(page_to_page_pool_iov(page),
+ count);
+
+   return page_ref_add(page, count);
+}
+
+static inline void page_pool_page_put_many(struct page *page,
+  unsigned int count)
+{
+   if (page_is_page_pool_iov(page))
+   return page_pool_iov_put_many(page_to_page_pool_iov(page),
+ count);
+
+   if (count > 1)
+   page_ref_sub(page, count - 1);
+
+   put_page(page);
+}
+
+static inline bool page_pool_page_is_pfmemalloc(struct page *page)
+{
+   if (page_is_page_pool_iov(page))
+   return false;
+
+   return page_is_pfmemalloc(page);
+}
+
+static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
+{
+   /* Assume page_pool_iov are on the preferred node without actually
+* checking...
+*
+* This check is only used to check for recycling memory in the page
+* pool's fast paths. Currently the only implementation of page_pool_iov
+* is dmabuf device memory. It's a deliberate decision by the user to
+* bind a certain dmabuf to a certain netdev, and the netdev rx queue
+* would not be able to reallocate memory from another dmabuf that
+* exists on the preferred node, so, this check doesn't make much sense
+* in this case. Assume all page_pool_iovs can be recycled for now.
+*/
+   if (page_is_page_pool_iov(page))
+   return true;
+
+   return page_to_nid(page) == pref_nid;
+}
+
 struct page_pool {
struct page_pool_params p;

@@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, 
long nr)
 {
long ret;

+   if (page_is_page_pool_iov(page))
+   return -EINVAL;
+
/* If nr == pp_frag_count then we have cleared all remaining
 * references to the page. No need to actually overwrite it, instead
 * we can leave this to be overwritten by the calling function.
@@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct 
page_pool *pool,

 static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
 {
-   dma_addr_t ret = page->dma_addr;
+   dma_addr_t ret;
+
+   if (page_is_page_pool_iov(page))
+   return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
+
+   ret = page->dma_addr;

if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
@@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct 
page *page)

 static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
 {
+   /* page_pool_iovs are mapped and their dma-addr can't be modified. */
+