Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-10 Thread Jakub Kicinski
On Tue, 7 Nov 2023 11:53:22 -0800 Mina Almasry wrote:
> My bad on not including some docs about this. The next version should
> have the commit message beefed up to explain all this, or a docs
> patch.

Yes, please. Would be great to have the user facing interface well
explained under Documentation/


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-10 Thread Jakub Kicinski
On Tue, 7 Nov 2023 14:23:20 -0800 Stanislav Fomichev wrote:
> Can we mark a socket as devmem-only? Do we have any use-case for those
> hybrid setups? Or, let me put it that way: do we expect API callers
> to handle both linear and non-linear cases correctly?
> As a consumer of the previous versions of these apis internally,
> I find all those corner cases confusing :-( Hence trying to understand
> whether we can make it a bit more rigid and properly defined upstream.

FWIW I'd also prefer to allow mixing. "Some NICs" can decide HDS
very flexibly, incl. landing full jumbo frames into the "headers".

There's no sender API today to signal how to mark the data for
selective landing,  but if Mina already has the rx side written 
to allow that...


RE: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-08 Thread David Laight
From: Mina Almasry
> Sent: 06 November 2023 02:44
> 
> For device memory TCP, we expect the skb headers to be available in host
> memory for access, and we expect the skb frags to be in device memory
> and unaccessible to the host. We expect there to be no mixing and
> matching of device memory frags (unaccessible) with host memory frags
> (accessible) in the same skb.
> 
> Add a skb->devmem flag which indicates whether the frags in this skb
> are device memory frags or not.
> 
...
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 1fae276c1353..8fb468ff8115 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
>   *   @csum_level: indicates the number of consecutive checksums found in
>   *   the packet minus one that have been verified as
>   *   CHECKSUM_UNNECESSARY (max 3)
> + *   @devmem: indicates that all the fragments in this skb are backed by
> + *   device memory.
>   *   @dst_pending_confirm: need to confirm neighbour
>   *   @decrypted: Decrypted SKB
>   *   @slow_gro: state present at GRO time, slower prepare step required
> @@ -991,7 +993,7 @@ struct sk_buff {
>  #if IS_ENABLED(CONFIG_IP_SCTP)
>   __u8csum_not_inet:1;
>  #endif
> -
> + __u8devmem:1;
>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>   __u16   tc_index;   /* traffic control index */
>  #endif
> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct 
> sk_buff *skb)
>   __skb_zcopy_downgrade_managed(skb);
>  }

Doesn't that bloat struct sk_buff?
I'm not sure there are any spare bits available.
Although CONFIG_NET_SWITCHDEV and CONFIG_NET_SCHED seem to
already add padding.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-07 Thread Stanislav Fomichev
On 11/07, Eric Dumazet wrote:
> On Tue, Nov 7, 2023 at 10:05 PM Stanislav Fomichev  wrote:
> 
> >
> > I don't understand. We require an elaborate setup to receive devmem cmsgs,
> > why would some random application receive those?
> 
> 
> A TCP socket can receive 'valid TCP packets' from many different sources,
> especially with BPF hooks...
> 
> Think of a bonding setup, packets being mirrored by some switches or
> even from tc.
> 
> Better double check than be sorry.
> 
> We have not added a 5th component in the 4-tuple lookups, being "is
> this socket a devmem one".
> 
> A mix of regular/devmem skb is supported.

Can we mark a socket as devmem-only? Do we have any use-case for those
hybrid setups? Or, let me put it that way: do we expect API callers
to handle both linear and non-linear cases correctly?
As a consumer of the previous versions of these apis internally,
I find all those corner cases confusing :-( Hence trying to understand
whether we can make it a bit more rigid and properly defined upstream.

But going back to that MSG_SOCK_DEVMEM flag. If the application is
supposed to handle both linear and devmem chucks, why do we need
this extra MSG_SOCK_DEVMEM opt-in to signal that it's able to process
it? From Mina's reply, it seemed like MSG_SOCK_DEVMEM is there to
protect random applications that get misrouted devmem skb. I don't
see how returning EFAULT helps in that case.


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-07 Thread Eric Dumazet
On Tue, Nov 7, 2023 at 10:05 PM Stanislav Fomichev  wrote:

>
> I don't understand. We require an elaborate setup to receive devmem cmsgs,
> why would some random application receive those?


A TCP socket can receive 'valid TCP packets' from many different sources,
especially with BPF hooks...

Think of a bonding setup, packets being mirrored by some switches or
even from tc.

Better double check than be sorry.

We have not added a 5th component in the 4-tuple lookups, being "is
this socket a devmem one".

A mix of regular/devmem skb is supported.


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-07 Thread Stanislav Fomichev
On 11/07, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 5:06 PM Stanislav Fomichev  wrote:
> [..]
> > > > > And the socket has to know this association; otherwise those tokens
> > > > > are useless since they don't carry anything to identify the dmabuf.
> > > > >
> > > > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > > > > it somehow implies that I have an option of passing or not passing it
> > > > > for an individual system call.
> > >
> > > You do have the option of passing it or not passing it per system
> > > call. The MSG_SOCK_DEVMEM says the application is willing to receive
> > > devmem cmsgs - that's all. The application doesn't get to decide
> > > whether it's actually going to receive a devmem cmsg or not, because
> > > that's dictated by the type of skb that is present in the receive
> > > queue, and not up to the application. I should explain this in the
> > > commit message...
> >
> > What would be the case of passing it or not passing it? Some fallback to
> > the host memory after flow steering update? Yeah, would be useful to
> > document those constrains. I'd lean on starting stricter and relaxing
> > those conditions if we find the use-cases.
> >
> 
> MSG_SOCK_DEVMEM (or its replacement SOCK_DEVMEM or SO_SOCK_DEVMEM),
> just says that the application is able to receive devmem cmsgs and
> will parse them. The use case for not setting that flag is existing
> applications that are not aware of devmem cmsgs. I don't want those
> applications to think they're receiving data in the linear buffer only
> to find out that the data is in devmem and they ignored the devmem
> cmsg.
> 
> So, what happens:
> 
> - MSG_SOCK_DEVMEM provided and next skb in the queue is devmem:
> application receives cmsgs.
> - MSG_SOCK_DEVMEM provided and next skb in the queue is non-devmem:
> application receives in the linear buffer.
> - MSG_SOCK_DEVMEM not provided and net skb is devmem: application
> receives EFAULT.
> - MSG_SOCK_DEVMEM not provided and next skb is non-devmem: application
> receives in the linear buffer.
> 
> My bad on not including some docs about this. The next version should
> have the commit message beefed up to explain all this, or a docs
> patch.

I don't understand. We require an elaborate setup to receive devmem cmsgs,
why would some random application receive those?


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-07 Thread Mina Almasry
On Mon, Nov 6, 2023 at 5:06 PM Stanislav Fomichev  wrote:
[..]
> > > > And the socket has to know this association; otherwise those tokens
> > > > are useless since they don't carry anything to identify the dmabuf.
> > > >
> > > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > > > it somehow implies that I have an option of passing or not passing it
> > > > for an individual system call.
> >
> > You do have the option of passing it or not passing it per system
> > call. The MSG_SOCK_DEVMEM says the application is willing to receive
> > devmem cmsgs - that's all. The application doesn't get to decide
> > whether it's actually going to receive a devmem cmsg or not, because
> > that's dictated by the type of skb that is present in the receive
> > queue, and not up to the application. I should explain this in the
> > commit message...
>
> What would be the case of passing it or not passing it? Some fallback to
> the host memory after flow steering update? Yeah, would be useful to
> document those constrains. I'd lean on starting stricter and relaxing
> those conditions if we find the use-cases.
>

MSG_SOCK_DEVMEM (or its replacement SOCK_DEVMEM or SO_SOCK_DEVMEM),
just says that the application is able to receive devmem cmsgs and
will parse them. The use case for not setting that flag is existing
applications that are not aware of devmem cmsgs. I don't want those
applications to think they're receiving data in the linear buffer only
to find out that the data is in devmem and they ignored the devmem
cmsg.

So, what happens:

- MSG_SOCK_DEVMEM provided and next skb in the queue is devmem:
application receives cmsgs.
- MSG_SOCK_DEVMEM provided and next skb in the queue is non-devmem:
application receives in the linear buffer.
- MSG_SOCK_DEVMEM not provided and net skb is devmem: application
receives EFAULT.
- MSG_SOCK_DEVMEM not provided and next skb is non-devmem: application
receives in the linear buffer.

My bad on not including some docs about this. The next version should
have the commit message beefed up to explain all this, or a docs
patch.


-- 
Thanks,
Mina


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-07 Thread Stanislav Fomichev
On 11/07, Willem de Bruijn wrote:
> On Tue, Nov 7, 2023 at 12:44 PM Stanislav Fomichev  wrote:
> >
> > On 11/06, Willem de Bruijn wrote:
> > > > > > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is 
> > > > > > > that
> > > > > > > it somehow implies that I have an option of passing or not 
> > > > > > > passing it
> > > > > > > for an individual system call.
> > > > > > > If we know that we're going to use dmabuf with the socket, maybe 
> > > > > > > we
> > > > > > > should move this flag to the socket() syscall?
> > > > > > >
> > > > > > > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> > > > > > >
> > > > > > > ?
> > > > > >
> > > > > > I think it should then be a setsockopt called before any data is
> > > > > > exchanged, with no change of modifying mode later. We generally use
> > > > > > setsockopts for the mode of a socket. This use of the protocol field
> > > > > > in socket() for setting a mode would be novel. Also, it might miss
> > > > > > passively opened connections, or be overly restrictive: one approach
> > > > > > for all accepted child sockets.
> > > > >
> > > > > I was thinking this is similar to SOCK_CLOEXEC or SOCK_NONBLOCK? There
> > > > > are plenty of bits we can grab. But setsockopt works as well!
> > > >
> > > > To follow up: if we have this flag on a socket, not on a per-message
> > > > basis, can we also use recvmsg for the recycling part maybe?
> > > >
> > > > while (true) {
> > > > memset(msg, 0, ...);
> > > >
> > > > /* receive the tokens */
> > > > ret = recvmsg(fd, , 0);
> > > >
> > > > /* recycle the tokens from the above recvmsg() */
> > > > ret = recvmsg(fd, , MSG_RECYCLE);
> > > > }
> > > >
> > > > recvmsg + MSG_RECYCLE can parse the same format that regular recvmsg
> > > > exports (SO_DEVMEM_OFFSET) and we can also add extra cmsg option
> > > > to recycle a range.
> > > >
> > > > Will this be more straightforward than a setsockopt(SO_DEVMEM_DONTNEED)?
> > > > Or is it more confusing?
> > >
> > > It would have to be sendmsg, as recvmsg is a copy_to_user operation.
> > >
> > >
> > > I am not aware of any precedent in multiplexing the data stream and a
> > > control operation stream in this manner. It would also require adding
> > > a branch in the sendmsg hot path.
> >
> > Is it too much plumbing to copy_from_user msg_control deep in recvmsg
> > stack where we need it? Mixing in sendmsg is indeed ugly :-(
> 
> I tried exactly the inverse of that when originally adding
> MSG_ZEROCOPY: to allow piggy-backing zerocopy completion notifications
> on sendmsg calls by writing to sendmsg msg_control on return to user.
> It required significant code churn, which the performance gains did
> not warrant. Doing so also breaks the simple rule that recv is for
> reading and send is for writing.

We're breaking so many rules here, so not sure we should be super
constrained :-D

> > Regarding hot patch: aren't we already doing copy_to_user for the tokens in
> > this hot path, so having one extra condition shouldn't hurt too much?
> 
> We're doing that in the optional cmsg handling of recvmsg, which is
> already a slow path (compared to the data read() itself).
> 
> > > The memory is associated with the socket, freed when the socket is
> > > closed as well as on SO_DEVMEM_DONTNEED. Fundamentally it is a socket
> > > state operation, for which setsockopt is the socket interface.
> > >
> > > Is your request purely a dislike, or is there some technical concern
> > > with BPF and setsockopt?
> >
> > It's mostly because I've been bitten too much by custom socket options that
> > are not really on/off/update-value operations:
> >
> > 29ebbba7d461 - bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen
> > 00e74ae08638 - bpf: Don't EFAULT for getsockopt with optval=NULL
> > 9cacf81f8161 - bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE
> > d8fe449a9c51 - bpf: Don't return EINVAL from {get,set}sockopt when optlen > 
> > PAGE_SIZE
> >
> > I do agree that this particular case of SO_DEVMEM_DONTNEED seems ok, but
> > things tend to evolve and change.
> 
> I see. I'm a bit concerned if we start limiting what we can do in
> sockets because of dependencies that BPF processing places on them.
> The use case for BPF [gs]etsockopt is limited to specific control mode
> calls. Would it make sense to just exclude calls like
> SO_DEVMEM_DONTNEED from this interpositioning?

Yup, that's why I'm asking. We already have ->bpf_bypass_getsockopt()
to special-case tcp zerocopy. We might add another bpf_bypass_setsockopt
to special case SO_DEVMEM_DONTNEED. That's why I'm trying to see if
there is a better alternative.

> At a high level what we really want is a high rate metadata path from
> user to kernel. And there are no perfect solutions. From kernel to
> user we use the socket error queue for this. That was never intended
> for high event rate itself, dealing with ICMP errors and the like
> before timestamps and zerocopy notifications were 

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-07 Thread Willem de Bruijn
On Tue, Nov 7, 2023 at 12:44 PM Stanislav Fomichev  wrote:
>
> On 11/06, Willem de Bruijn wrote:
> > > > > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > > > > > it somehow implies that I have an option of passing or not passing 
> > > > > > it
> > > > > > for an individual system call.
> > > > > > If we know that we're going to use dmabuf with the socket, maybe we
> > > > > > should move this flag to the socket() syscall?
> > > > > >
> > > > > > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> > > > > >
> > > > > > ?
> > > > >
> > > > > I think it should then be a setsockopt called before any data is
> > > > > exchanged, with no change of modifying mode later. We generally use
> > > > > setsockopts for the mode of a socket. This use of the protocol field
> > > > > in socket() for setting a mode would be novel. Also, it might miss
> > > > > passively opened connections, or be overly restrictive: one approach
> > > > > for all accepted child sockets.
> > > >
> > > > I was thinking this is similar to SOCK_CLOEXEC or SOCK_NONBLOCK? There
> > > > are plenty of bits we can grab. But setsockopt works as well!
> > >
> > > To follow up: if we have this flag on a socket, not on a per-message
> > > basis, can we also use recvmsg for the recycling part maybe?
> > >
> > > while (true) {
> > > memset(msg, 0, ...);
> > >
> > > /* receive the tokens */
> > > ret = recvmsg(fd, , 0);
> > >
> > > /* recycle the tokens from the above recvmsg() */
> > > ret = recvmsg(fd, , MSG_RECYCLE);
> > > }
> > >
> > > recvmsg + MSG_RECYCLE can parse the same format that regular recvmsg
> > > exports (SO_DEVMEM_OFFSET) and we can also add extra cmsg option
> > > to recycle a range.
> > >
> > > Will this be more straightforward than a setsockopt(SO_DEVMEM_DONTNEED)?
> > > Or is it more confusing?
> >
> > It would have to be sendmsg, as recvmsg is a copy_to_user operation.
> >
> >
> > I am not aware of any precedent in multiplexing the data stream and a
> > control operation stream in this manner. It would also require adding
> > a branch in the sendmsg hot path.
>
> Is it too much plumbing to copy_from_user msg_control deep in recvmsg
> stack where we need it? Mixing in sendmsg is indeed ugly :-(

I tried exactly the inverse of that when originally adding
MSG_ZEROCOPY: to allow piggy-backing zerocopy completion notifications
on sendmsg calls by writing to sendmsg msg_control on return to user.
It required significant code churn, which the performance gains did
not warrant. Doing so also breaks the simple rule that recv is for
reading and send is for writing.

> Regarding hot patch: aren't we already doing copy_to_user for the tokens in
> this hot path, so having one extra condition shouldn't hurt too much?

We're doing that in the optional cmsg handling of recvmsg, which is
already a slow path (compared to the data read() itself).

> > The memory is associated with the socket, freed when the socket is
> > closed as well as on SO_DEVMEM_DONTNEED. Fundamentally it is a socket
> > state operation, for which setsockopt is the socket interface.
> >
> > Is your request purely a dislike, or is there some technical concern
> > with BPF and setsockopt?
>
> It's mostly because I've been bitten too much by custom socket options that
> are not really on/off/update-value operations:
>
> 29ebbba7d461 - bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen
> 00e74ae08638 - bpf: Don't EFAULT for getsockopt with optval=NULL
> 9cacf81f8161 - bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE
> d8fe449a9c51 - bpf: Don't return EINVAL from {get,set}sockopt when optlen > 
> PAGE_SIZE
>
> I do agree that this particular case of SO_DEVMEM_DONTNEED seems ok, but
> things tend to evolve and change.

I see. I'm a bit concerned if we start limiting what we can do in
sockets because of dependencies that BPF processing places on them.
The use case for BPF [gs]etsockopt is limited to specific control mode
calls. Would it make sense to just exclude calls like
SO_DEVMEM_DONTNEED from this interpositioning?

At a high level what we really want is a high rate metadata path from
user to kernel. And there are no perfect solutions. From kernel to
user we use the socket error queue for this. That was never intended
for high event rate itself, dealing with ICMP errors and the like
before timestamps and zerocopy notifications were added.

If I squint hard enough I can see some prior art in mixing data and
high rate state changes within the same channel in NIC descriptor
queues, where some devices do this, e.g.,  { "insert encryption key",
"send packet" }. But fundamentally I think we should keep the socket
queues for data only.


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-07 Thread Stanislav Fomichev
On 11/06, Willem de Bruijn wrote:
> > > > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > > > > it somehow implies that I have an option of passing or not passing it
> > > > > for an individual system call.
> > > > > If we know that we're going to use dmabuf with the socket, maybe we
> > > > > should move this flag to the socket() syscall?
> > > > >
> > > > > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> > > > >
> > > > > ?
> > > >
> > > > I think it should then be a setsockopt called before any data is
> > > > exchanged, with no change of modifying mode later. We generally use
> > > > setsockopts for the mode of a socket. This use of the protocol field
> > > > in socket() for setting a mode would be novel. Also, it might miss
> > > > passively opened connections, or be overly restrictive: one approach
> > > > for all accepted child sockets.
> > >
> > > I was thinking this is similar to SOCK_CLOEXEC or SOCK_NONBLOCK? There
> > > are plenty of bits we can grab. But setsockopt works as well!
> >
> > To follow up: if we have this flag on a socket, not on a per-message
> > basis, can we also use recvmsg for the recycling part maybe?
> >
> > while (true) {
> > memset(msg, 0, ...);
> >
> > /* receive the tokens */
> > ret = recvmsg(fd, , 0);
> >
> > /* recycle the tokens from the above recvmsg() */
> > ret = recvmsg(fd, , MSG_RECYCLE);
> > }
> >
> > recvmsg + MSG_RECYCLE can parse the same format that regular recvmsg
> > exports (SO_DEVMEM_OFFSET) and we can also add extra cmsg option
> > to recycle a range.
> >
> > Will this be more straightforward than a setsockopt(SO_DEVMEM_DONTNEED)?
> > Or is it more confusing?
> 
> It would have to be sendmsg, as recvmsg is a copy_to_user operation.
>
>
> I am not aware of any precedent in multiplexing the data stream and a
> control operation stream in this manner. It would also require adding
> a branch in the sendmsg hot path.

Is it too much plumbing to copy_from_user msg_control deep in recvmsg
stack where we need it? Mixing in sendmsg is indeed ugly :-(

Regarding hot patch: aren't we already doing copy_to_user for the tokens in
this hot path, so having one extra condition shouldn't hurt too much?

> The memory is associated with the socket, freed when the socket is
> closed as well as on SO_DEVMEM_DONTNEED. Fundamentally it is a socket
> state operation, for which setsockopt is the socket interface.
> 
> Is your request purely a dislike, or is there some technical concern
> with BPF and setsockopt?

It's mostly because I've been bitten too much by custom socket options that
are not really on/off/update-value operations:

29ebbba7d461 - bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen
00e74ae08638 - bpf: Don't EFAULT for getsockopt with optval=NULL
9cacf81f8161 - bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE
d8fe449a9c51 - bpf: Don't return EINVAL from {get,set}sockopt when optlen > 
PAGE_SIZE

I do agree that this particular case of SO_DEVMEM_DONTNEED seems ok, but
things tend to evolve and change.


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread Willem de Bruijn
> > > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > > > it somehow implies that I have an option of passing or not passing it
> > > > for an individual system call.
> > > > If we know that we're going to use dmabuf with the socket, maybe we
> > > > should move this flag to the socket() syscall?
> > > >
> > > > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> > > >
> > > > ?
> > >
> > > I think it should then be a setsockopt called before any data is
> > > exchanged, with no change of modifying mode later. We generally use
> > > setsockopts for the mode of a socket. This use of the protocol field
> > > in socket() for setting a mode would be novel. Also, it might miss
> > > passively opened connections, or be overly restrictive: one approach
> > > for all accepted child sockets.
> >
> > I was thinking this is similar to SOCK_CLOEXEC or SOCK_NONBLOCK? There
> > are plenty of bits we can grab. But setsockopt works as well!
>
> To follow up: if we have this flag on a socket, not on a per-message
> basis, can we also use recvmsg for the recycling part maybe?
>
> while (true) {
> memset(msg, 0, ...);
>
> /* receive the tokens */
> ret = recvmsg(fd, , 0);
>
> /* recycle the tokens from the above recvmsg() */
> ret = recvmsg(fd, , MSG_RECYCLE);
> }
>
> recvmsg + MSG_RECYCLE can parse the same format that regular recvmsg
> exports (SO_DEVMEM_OFFSET) and we can also add extra cmsg option
> to recycle a range.
>
> Will this be more straightforward than a setsockopt(SO_DEVMEM_DONTNEED)?
> Or is it more confusing?

It would have to be sendmsg, as recvmsg is a copy_to_user operation.

I am not aware of any precedent in multiplexing the data stream and a
control operation stream in this manner. It would also require adding
a branch in the sendmsg hot path.

The memory is associated with the socket, freed when the socket is
closed as well as on SO_DEVMEM_DONTNEED. Fundamentally it is a socket
state operation, for which setsockopt is the socket interface.

Is your request purely a dislike, or is there some technical concern
with BPF and setsockopt?


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread David Ahern
On 11/6/23 5:20 PM, Mina Almasry wrote:
> The user is free to modify or delete flow steering rules outside of the
> lifetime of the socket. Technically it's possible for the user to
> reconfigure flow steering while the socket is simultaneously receiving,
> and the result will be packets switching from devmem to non-devmem.

generically, from one page pool to another (ie., devmem piece of that
statement is not relevant).


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread Stanislav Fomichev
On 11/06, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 4:08 PM Willem de Bruijn
>  wrote:
> >
> > On Mon, Nov 6, 2023 at 3:55 PM Stanislav Fomichev  wrote:
> > >
> > > On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry  
> > > wrote:
> > > >
> > > > On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev  
> > > > wrote:
> > > > >
> > > > > On 11/06, Mina Almasry wrote:
> > > > > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev  
> > > > > > wrote:
> > > > > > >
> > > > > > > On 11/06, Mina Almasry wrote:
> > > > > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > > > > >> For device memory TCP, we expect the skb headers to be 
> > > > > > > > > >> available in host
> > > > > > > > > >> memory for access, and we expect the skb frags to be in 
> > > > > > > > > >> device memory
> > > > > > > > > >> and unaccessible to the host. We expect there to be no 
> > > > > > > > > >> mixing and
> > > > > > > > > >> matching of device memory frags (unaccessible) with host 
> > > > > > > > > >> memory frags
> > > > > > > > > >> (accessible) in the same skb.
> > > > > > > > > >>
> > > > > > > > > >> Add a skb->devmem flag which indicates whether the frags 
> > > > > > > > > >> in this skb
> > > > > > > > > >> are device memory frags or not.
> > > > > > > > > >>
> > > > > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for 
> > > > > > > > > >> page_pool_iovs,
> > > > > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > > > > >>
> > > > > > > > > >> Add checks through the network stack to avoid accessing 
> > > > > > > > > >> the frags of
> > > > > > > > > >> devmem skbs and avoid coalescing devmem skbs with non 
> > > > > > > > > >> devmem skbs.
> > > > > > > > > >>
> > > > > > > > > >> Signed-off-by: Willem de Bruijn 
> > > > > > > > > >> Signed-off-by: Kaiyuan Zhang 
> > > > > > > > > >> Signed-off-by: Mina Almasry 
> > > > > > > > > >>
> > > > > > > > > >> ---
> > > > > > > > > >>  include/linux/skbuff.h | 14 +++-
> > > > > > > > > >>  include/net/tcp.h  |  5 +--
> > > > > > > > > >>  net/core/datagram.c|  6 
> > > > > > > > > >>  net/core/gro.c |  5 ++-
> > > > > > > > > >>  net/core/skbuff.c  | 77 
> > > > > > > > > >> --
> > > > > > > > > >>  net/ipv4/tcp.c |  6 
> > > > > > > > > >>  net/ipv4/tcp_input.c   | 13 +--
> > > > > > > > > >>  net/ipv4/tcp_output.c  |  5 ++-
> > > > > > > > > >>  net/packet/af_packet.c |  4 +--
> > > > > > > > > >>  9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > > > > >>
> > > > > > > > > >> diff --git a/include/linux/skbuff.h 
> > > > > > > > > >> b/include/linux/skbuff.h
> > > > > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > > > > >> --- a/include/linux/skbuff.h
> > > > > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > > > > >>   *  @csum_level: indicates the number of consecutive 
> > > > > > > > > >> checksums found in
> > > > > > > > > >>   *  the packet minus one that have been verified 
> > > > > > > > > >> as
> > > > > > > > > >>   *  CHECKSUM_UNNECESSARY (max 3)
> > > > > > > > > >> + *  @devmem: indicates that all the fragments in this skb 
> > > > > > > > > >> are backed by
> > > > > > > > > >> + *  device memory.
> > > > > > > > > >>   *  @dst_pending_confirm: need to confirm neighbour
> > > > > > > > > >>   *  @decrypted: Decrypted SKB
> > > > > > > > > >>   *  @slow_gro: state present at GRO time, slower prepare 
> > > > > > > > > >> step required
> > > > > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > > > > >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > > > > >>  __u8csum_not_inet:1;
> > > > > > > > > >>  #endif
> > > > > > > > > >> -
> > > > > > > > > >> +__u8devmem:1;
> > > > > > > > > >>  #if defined(CONFIG_NET_SCHED) || 
> > > > > > > > > >> defined(CONFIG_NET_XGRESS)
> > > > > > > > > >>  __u16   tc_index;   /* traffic 
> > > > > > > > > >> control index */
> > > > > > > > > >>  #endif
> > > > > > > > > >> @@ -1766,6 +1768,12 @@ static inline void 
> > > > > > > > > >> skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > > > > >>  __skb_zcopy_downgrade_managed(skb);
> > > > > > > > > >>  }
> > > > > > > > > >>
> > > > > > > > > >> +/* Return true if frags in this skb are not readable by 
> > > > > > > > > >> the host. */
> > > > > > > > > >> +static inline bool skb_frags_not_readable(const struct 
> > > > > > > > > >> sk_buff *skb)
> > > > > > > > > >> +{
> > > > > > > > > >> +return skb->devmem;
> > > > > > > > > >
> > > > > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag 
> > > > > > > > > > to 'not_readable'?
> > > > > > 

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread Stanislav Fomichev
On 11/06, Stanislav Fomichev wrote:
> On 11/06, Willem de Bruijn wrote:
> > On Mon, Nov 6, 2023 at 3:55 PM Stanislav Fomichev  wrote:
> > >
> > > On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry  
> > > wrote:
> > > >
> > > > On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev  
> > > > wrote:
> > > > >
> > > > > On 11/06, Mina Almasry wrote:
> > > > > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev  
> > > > > > wrote:
> > > > > > >
> > > > > > > On 11/06, Mina Almasry wrote:
> > > > > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > > > > >> For device memory TCP, we expect the skb headers to be 
> > > > > > > > > >> available in host
> > > > > > > > > >> memory for access, and we expect the skb frags to be in 
> > > > > > > > > >> device memory
> > > > > > > > > >> and unaccessible to the host. We expect there to be no 
> > > > > > > > > >> mixing and
> > > > > > > > > >> matching of device memory frags (unaccessible) with host 
> > > > > > > > > >> memory frags
> > > > > > > > > >> (accessible) in the same skb.
> > > > > > > > > >>
> > > > > > > > > >> Add a skb->devmem flag which indicates whether the frags 
> > > > > > > > > >> in this skb
> > > > > > > > > >> are device memory frags or not.
> > > > > > > > > >>
> > > > > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for 
> > > > > > > > > >> page_pool_iovs,
> > > > > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > > > > >>
> > > > > > > > > >> Add checks through the network stack to avoid accessing 
> > > > > > > > > >> the frags of
> > > > > > > > > >> devmem skbs and avoid coalescing devmem skbs with non 
> > > > > > > > > >> devmem skbs.
> > > > > > > > > >>
> > > > > > > > > >> Signed-off-by: Willem de Bruijn 
> > > > > > > > > >> Signed-off-by: Kaiyuan Zhang 
> > > > > > > > > >> Signed-off-by: Mina Almasry 
> > > > > > > > > >>
> > > > > > > > > >> ---
> > > > > > > > > >>  include/linux/skbuff.h | 14 +++-
> > > > > > > > > >>  include/net/tcp.h  |  5 +--
> > > > > > > > > >>  net/core/datagram.c|  6 
> > > > > > > > > >>  net/core/gro.c |  5 ++-
> > > > > > > > > >>  net/core/skbuff.c  | 77 
> > > > > > > > > >> --
> > > > > > > > > >>  net/ipv4/tcp.c |  6 
> > > > > > > > > >>  net/ipv4/tcp_input.c   | 13 +--
> > > > > > > > > >>  net/ipv4/tcp_output.c  |  5 ++-
> > > > > > > > > >>  net/packet/af_packet.c |  4 +--
> > > > > > > > > >>  9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > > > > >>
> > > > > > > > > >> diff --git a/include/linux/skbuff.h 
> > > > > > > > > >> b/include/linux/skbuff.h
> > > > > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > > > > >> --- a/include/linux/skbuff.h
> > > > > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > > > > >>   *  @csum_level: indicates the number of consecutive 
> > > > > > > > > >> checksums found in
> > > > > > > > > >>   *  the packet minus one that have been verified 
> > > > > > > > > >> as
> > > > > > > > > >>   *  CHECKSUM_UNNECESSARY (max 3)
> > > > > > > > > >> + *  @devmem: indicates that all the fragments in this skb 
> > > > > > > > > >> are backed by
> > > > > > > > > >> + *  device memory.
> > > > > > > > > >>   *  @dst_pending_confirm: need to confirm neighbour
> > > > > > > > > >>   *  @decrypted: Decrypted SKB
> > > > > > > > > >>   *  @slow_gro: state present at GRO time, slower prepare 
> > > > > > > > > >> step required
> > > > > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > > > > >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > > > > >>  __u8csum_not_inet:1;
> > > > > > > > > >>  #endif
> > > > > > > > > >> -
> > > > > > > > > >> +__u8devmem:1;
> > > > > > > > > >>  #if defined(CONFIG_NET_SCHED) || 
> > > > > > > > > >> defined(CONFIG_NET_XGRESS)
> > > > > > > > > >>  __u16   tc_index;   /* traffic 
> > > > > > > > > >> control index */
> > > > > > > > > >>  #endif
> > > > > > > > > >> @@ -1766,6 +1768,12 @@ static inline void 
> > > > > > > > > >> skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > > > > >>  __skb_zcopy_downgrade_managed(skb);
> > > > > > > > > >>  }
> > > > > > > > > >>
> > > > > > > > > >> +/* Return true if frags in this skb are not readable by 
> > > > > > > > > >> the host. */
> > > > > > > > > >> +static inline bool skb_frags_not_readable(const struct 
> > > > > > > > > >> sk_buff *skb)
> > > > > > > > > >> +{
> > > > > > > > > >> +return skb->devmem;
> > > > > > > > > >
> > > > > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag 
> > > > > > > > > > to 'not_readable'?
> > > > > > > > > > It better 

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread Mina Almasry
On Mon, Nov 6, 2023 at 4:16 PM David Ahern  wrote:
>
> On 11/5/23 7:44 PM, Mina Almasry wrote:
> > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > index 176eb5834746..cdd4fb129968 100644
> > --- a/net/core/datagram.c
> > +++ b/net/core/datagram.c
> > @@ -425,6 +425,9 @@ static int __skb_datagram_iter(const struct sk_buff 
> > *skb, int offset,
> >   return 0;
> >   }
> >
> > + if (skb_frags_not_readable(skb))
> > + goto short_copy;
> > +
> >   /* Copy paged appendix. Hmm... why does this look so complicated? */
> >   for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> >   int end;
> > @@ -616,6 +619,9 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct 
> > sock *sk,
> >  {
> >   int frag;
> >
> > + if (skb_frags_not_readable(skb))
> > + return -EFAULT;
>
> This check 
> > +
> >   if (msg && msg->msg_ubuf && msg->sg_from_iter)
> >   return msg->sg_from_iter(sk, skb, from, length);
>
>
> ... should go here. That allows custome sg_from_iter to have access to
> the skb. What matters is not expecting struct page (e.g., refcounting);
> if the custom iter does not do that then all is well. io_uring's iter
> does not look at the pages, so all good.
>
> >
> > diff --git a/net/core/gro.c b/net/core/gro.c
> > index 42d7f6755f32..56046d65386a 100644
> > --- a/net/core/gro.c
> > +++ b/net/core/gro.c
> > @@ -390,6 +390,9 @@ static void gro_pull_from_frag0(struct sk_buff *skb, 
> > int grow)
> >  {
> >   struct skb_shared_info *pinfo = skb_shinfo(skb);
> >
> > + if (WARN_ON_ONCE(skb_frags_not_readable(skb)))
> > + return;
> > +
> >   BUG_ON(skb->end - skb->tail < grow);
> >
> >   memcpy(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow);
> > @@ -411,7 +414,7 @@ static void gro_try_pull_from_frag0(struct sk_buff *skb)
> >  {
> >   int grow = skb_gro_offset(skb) - skb_headlen(skb);
> >
> > - if (grow > 0)
> > + if (grow > 0 && !skb_frags_not_readable(skb))
> >   gro_pull_from_frag0(skb, grow);
> >  }
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 13eca4fd25e1..f01673ed2eff 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -1230,6 +1230,14 @@ void skb_dump(const char *level, const struct 
> > sk_buff *skb, bool full_pkt)
> >   struct page *p;
> >   u8 *vaddr;
> >
> > + if (skb_frag_is_page_pool_iov(frag)) {
>
> Why skb_frag_is_page_pool_iov here vs skb_frags_not_readable?

Seems like a silly choice on my end. I should probably check
skb_frags_not_readable() and not kmap any frags in that case. Will do.

-- 
Thanks,
Mina


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread Mina Almasry
On Mon, Nov 6, 2023 at 4:08 PM Willem de Bruijn
 wrote:
>
> On Mon, Nov 6, 2023 at 3:55 PM Stanislav Fomichev  wrote:
> >
> > On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry  wrote:
> > >
> > > On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev  wrote:
> > > >
> > > > On 11/06, Mina Almasry wrote:
> > > > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev  
> > > > > wrote:
> > > > > >
> > > > > > On 11/06, Mina Almasry wrote:
> > > > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > > > >> For device memory TCP, we expect the skb headers to be 
> > > > > > > > >> available in host
> > > > > > > > >> memory for access, and we expect the skb frags to be in 
> > > > > > > > >> device memory
> > > > > > > > >> and unaccessible to the host. We expect there to be no 
> > > > > > > > >> mixing and
> > > > > > > > >> matching of device memory frags (unaccessible) with host 
> > > > > > > > >> memory frags
> > > > > > > > >> (accessible) in the same skb.
> > > > > > > > >>
> > > > > > > > >> Add a skb->devmem flag which indicates whether the frags in 
> > > > > > > > >> this skb
> > > > > > > > >> are device memory frags or not.
> > > > > > > > >>
> > > > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for 
> > > > > > > > >> page_pool_iovs,
> > > > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > > > >>
> > > > > > > > >> Add checks through the network stack to avoid accessing the 
> > > > > > > > >> frags of
> > > > > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem 
> > > > > > > > >> skbs.
> > > > > > > > >>
> > > > > > > > >> Signed-off-by: Willem de Bruijn 
> > > > > > > > >> Signed-off-by: Kaiyuan Zhang 
> > > > > > > > >> Signed-off-by: Mina Almasry 
> > > > > > > > >>
> > > > > > > > >> ---
> > > > > > > > >>  include/linux/skbuff.h | 14 +++-
> > > > > > > > >>  include/net/tcp.h  |  5 +--
> > > > > > > > >>  net/core/datagram.c|  6 
> > > > > > > > >>  net/core/gro.c |  5 ++-
> > > > > > > > >>  net/core/skbuff.c  | 77 
> > > > > > > > >> --
> > > > > > > > >>  net/ipv4/tcp.c |  6 
> > > > > > > > >>  net/ipv4/tcp_input.c   | 13 +--
> > > > > > > > >>  net/ipv4/tcp_output.c  |  5 ++-
> > > > > > > > >>  net/packet/af_packet.c |  4 +--
> > > > > > > > >>  9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > > > >>
> > > > > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > > > >> --- a/include/linux/skbuff.h
> > > > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > > > >>   *  @csum_level: indicates the number of consecutive 
> > > > > > > > >> checksums found in
> > > > > > > > >>   *  the packet minus one that have been verified as
> > > > > > > > >>   *  CHECKSUM_UNNECESSARY (max 3)
> > > > > > > > >> + *  @devmem: indicates that all the fragments in this skb 
> > > > > > > > >> are backed by
> > > > > > > > >> + *  device memory.
> > > > > > > > >>   *  @dst_pending_confirm: need to confirm neighbour
> > > > > > > > >>   *  @decrypted: Decrypted SKB
> > > > > > > > >>   *  @slow_gro: state present at GRO time, slower prepare 
> > > > > > > > >> step required
> > > > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > > > >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > > > >>  __u8csum_not_inet:1;
> > > > > > > > >>  #endif
> > > > > > > > >> -
> > > > > > > > >> +__u8devmem:1;
> > > > > > > > >>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > > > > >>  __u16   tc_index;   /* traffic 
> > > > > > > > >> control index */
> > > > > > > > >>  #endif
> > > > > > > > >> @@ -1766,6 +1768,12 @@ static inline void 
> > > > > > > > >> skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > > > >>  __skb_zcopy_downgrade_managed(skb);
> > > > > > > > >>  }
> > > > > > > > >>
> > > > > > > > >> +/* Return true if frags in this skb are not readable by the 
> > > > > > > > >> host. */
> > > > > > > > >> +static inline bool skb_frags_not_readable(const struct 
> > > > > > > > >> sk_buff *skb)
> > > > > > > > >> +{
> > > > > > > > >> +return skb->devmem;
> > > > > > > > >
> > > > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 
> > > > > > > > > 'not_readable'?
> > > > > > > > > It better communicates the fact that the stack shouldn't 
> > > > > > > > > dereference the
> > > > > > > > > frags (because it has 'devmem' fragments or for some other 
> > > > > > > > > potential
> > > > > > > > > future reason).
> > > > > > > >
> > > > > > > > +1.
> > > > > > > >
> > > > > > > > 

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread David Ahern
On 11/5/23 7:44 PM, Mina Almasry wrote:
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index 176eb5834746..cdd4fb129968 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -425,6 +425,9 @@ static int __skb_datagram_iter(const struct sk_buff *skb, 
> int offset,
>   return 0;
>   }
>  
> + if (skb_frags_not_readable(skb))
> + goto short_copy;
> +
>   /* Copy paged appendix. Hmm... why does this look so complicated? */
>   for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
>   int end;
> @@ -616,6 +619,9 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct 
> sock *sk,
>  {
>   int frag;
>  
> + if (skb_frags_not_readable(skb))
> + return -EFAULT;

This check 
> +
>   if (msg && msg->msg_ubuf && msg->sg_from_iter)
>   return msg->sg_from_iter(sk, skb, from, length);


... should go here. That allows custome sg_from_iter to have access to
the skb. What matters is not expecting struct page (e.g., refcounting);
if the custom iter does not do that then all is well. io_uring's iter
does not look at the pages, so all good.

>  
> diff --git a/net/core/gro.c b/net/core/gro.c
> index 42d7f6755f32..56046d65386a 100644
> --- a/net/core/gro.c
> +++ b/net/core/gro.c
> @@ -390,6 +390,9 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int 
> grow)
>  {
>   struct skb_shared_info *pinfo = skb_shinfo(skb);
>  
> + if (WARN_ON_ONCE(skb_frags_not_readable(skb)))
> + return;
> +
>   BUG_ON(skb->end - skb->tail < grow);
>  
>   memcpy(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow);
> @@ -411,7 +414,7 @@ static void gro_try_pull_from_frag0(struct sk_buff *skb)
>  {
>   int grow = skb_gro_offset(skb) - skb_headlen(skb);
>  
> - if (grow > 0)
> + if (grow > 0 && !skb_frags_not_readable(skb))
>   gro_pull_from_frag0(skb, grow);
>  }
>  
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 13eca4fd25e1..f01673ed2eff 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1230,6 +1230,14 @@ void skb_dump(const char *level, const struct sk_buff 
> *skb, bool full_pkt)
>   struct page *p;
>   u8 *vaddr;
>  
> + if (skb_frag_is_page_pool_iov(frag)) {

Why skb_frag_is_page_pool_iov here vs skb_frags_not_readable?


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread Stanislav Fomichev
On 11/06, Willem de Bruijn wrote:
> On Mon, Nov 6, 2023 at 3:55 PM Stanislav Fomichev  wrote:
> >
> > On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry  wrote:
> > >
> > > On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev  wrote:
> > > >
> > > > On 11/06, Mina Almasry wrote:
> > > > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev  
> > > > > wrote:
> > > > > >
> > > > > > On 11/06, Mina Almasry wrote:
> > > > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > > > >> For device memory TCP, we expect the skb headers to be 
> > > > > > > > >> available in host
> > > > > > > > >> memory for access, and we expect the skb frags to be in 
> > > > > > > > >> device memory
> > > > > > > > >> and unaccessible to the host. We expect there to be no 
> > > > > > > > >> mixing and
> > > > > > > > >> matching of device memory frags (unaccessible) with host 
> > > > > > > > >> memory frags
> > > > > > > > >> (accessible) in the same skb.
> > > > > > > > >>
> > > > > > > > >> Add a skb->devmem flag which indicates whether the frags in 
> > > > > > > > >> this skb
> > > > > > > > >> are device memory frags or not.
> > > > > > > > >>
> > > > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for 
> > > > > > > > >> page_pool_iovs,
> > > > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > > > >>
> > > > > > > > >> Add checks through the network stack to avoid accessing the 
> > > > > > > > >> frags of
> > > > > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem 
> > > > > > > > >> skbs.
> > > > > > > > >>
> > > > > > > > >> Signed-off-by: Willem de Bruijn 
> > > > > > > > >> Signed-off-by: Kaiyuan Zhang 
> > > > > > > > >> Signed-off-by: Mina Almasry 
> > > > > > > > >>
> > > > > > > > >> ---
> > > > > > > > >>  include/linux/skbuff.h | 14 +++-
> > > > > > > > >>  include/net/tcp.h  |  5 +--
> > > > > > > > >>  net/core/datagram.c|  6 
> > > > > > > > >>  net/core/gro.c |  5 ++-
> > > > > > > > >>  net/core/skbuff.c  | 77 
> > > > > > > > >> --
> > > > > > > > >>  net/ipv4/tcp.c |  6 
> > > > > > > > >>  net/ipv4/tcp_input.c   | 13 +--
> > > > > > > > >>  net/ipv4/tcp_output.c  |  5 ++-
> > > > > > > > >>  net/packet/af_packet.c |  4 +--
> > > > > > > > >>  9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > > > >>
> > > > > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > > > >> --- a/include/linux/skbuff.h
> > > > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > > > >>   *  @csum_level: indicates the number of consecutive 
> > > > > > > > >> checksums found in
> > > > > > > > >>   *  the packet minus one that have been verified as
> > > > > > > > >>   *  CHECKSUM_UNNECESSARY (max 3)
> > > > > > > > >> + *  @devmem: indicates that all the fragments in this skb 
> > > > > > > > >> are backed by
> > > > > > > > >> + *  device memory.
> > > > > > > > >>   *  @dst_pending_confirm: need to confirm neighbour
> > > > > > > > >>   *  @decrypted: Decrypted SKB
> > > > > > > > >>   *  @slow_gro: state present at GRO time, slower prepare 
> > > > > > > > >> step required
> > > > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > > > >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > > > >>  __u8csum_not_inet:1;
> > > > > > > > >>  #endif
> > > > > > > > >> -
> > > > > > > > >> +__u8devmem:1;
> > > > > > > > >>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > > > > >>  __u16   tc_index;   /* traffic 
> > > > > > > > >> control index */
> > > > > > > > >>  #endif
> > > > > > > > >> @@ -1766,6 +1768,12 @@ static inline void 
> > > > > > > > >> skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > > > >>  __skb_zcopy_downgrade_managed(skb);
> > > > > > > > >>  }
> > > > > > > > >>
> > > > > > > > >> +/* Return true if frags in this skb are not readable by the 
> > > > > > > > >> host. */
> > > > > > > > >> +static inline bool skb_frags_not_readable(const struct 
> > > > > > > > >> sk_buff *skb)
> > > > > > > > >> +{
> > > > > > > > >> +return skb->devmem;
> > > > > > > > >
> > > > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 
> > > > > > > > > 'not_readable'?
> > > > > > > > > It better communicates the fact that the stack shouldn't 
> > > > > > > > > dereference the
> > > > > > > > > frags (because it has 'devmem' fragments or for some other 
> > > > > > > > > potential
> > > > > > > > > future reason).
> > > > > > > >
> > > > > > > > +1.
> > > > > > > >
> > > > > > > > Also, the flag on the skb is 

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread Willem de Bruijn
On Mon, Nov 6, 2023 at 3:55 PM Stanislav Fomichev  wrote:
>
> On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry  wrote:
> >
> > On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev  wrote:
> > >
> > > On 11/06, Mina Almasry wrote:
> > > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev  
> > > > wrote:
> > > > >
> > > > > On 11/06, Mina Almasry wrote:
> > > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern  
> > > > > > wrote:
> > > > > > >
> > > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > > >> For device memory TCP, we expect the skb headers to be 
> > > > > > > >> available in host
> > > > > > > >> memory for access, and we expect the skb frags to be in device 
> > > > > > > >> memory
> > > > > > > >> and unaccessible to the host. We expect there to be no mixing 
> > > > > > > >> and
> > > > > > > >> matching of device memory frags (unaccessible) with host 
> > > > > > > >> memory frags
> > > > > > > >> (accessible) in the same skb.
> > > > > > > >>
> > > > > > > >> Add a skb->devmem flag which indicates whether the frags in 
> > > > > > > >> this skb
> > > > > > > >> are device memory frags or not.
> > > > > > > >>
> > > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for 
> > > > > > > >> page_pool_iovs,
> > > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > > >>
> > > > > > > >> Add checks through the network stack to avoid accessing the 
> > > > > > > >> frags of
> > > > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem 
> > > > > > > >> skbs.
> > > > > > > >>
> > > > > > > >> Signed-off-by: Willem de Bruijn 
> > > > > > > >> Signed-off-by: Kaiyuan Zhang 
> > > > > > > >> Signed-off-by: Mina Almasry 
> > > > > > > >>
> > > > > > > >> ---
> > > > > > > >>  include/linux/skbuff.h | 14 +++-
> > > > > > > >>  include/net/tcp.h  |  5 +--
> > > > > > > >>  net/core/datagram.c|  6 
> > > > > > > >>  net/core/gro.c |  5 ++-
> > > > > > > >>  net/core/skbuff.c  | 77 
> > > > > > > >> --
> > > > > > > >>  net/ipv4/tcp.c |  6 
> > > > > > > >>  net/ipv4/tcp_input.c   | 13 +--
> > > > > > > >>  net/ipv4/tcp_output.c  |  5 ++-
> > > > > > > >>  net/packet/af_packet.c |  4 +--
> > > > > > > >>  9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > > >>
> > > > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > > >> --- a/include/linux/skbuff.h
> > > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > > >>   *  @csum_level: indicates the number of consecutive 
> > > > > > > >> checksums found in
> > > > > > > >>   *  the packet minus one that have been verified as
> > > > > > > >>   *  CHECKSUM_UNNECESSARY (max 3)
> > > > > > > >> + *  @devmem: indicates that all the fragments in this skb are 
> > > > > > > >> backed by
> > > > > > > >> + *  device memory.
> > > > > > > >>   *  @dst_pending_confirm: need to confirm neighbour
> > > > > > > >>   *  @decrypted: Decrypted SKB
> > > > > > > >>   *  @slow_gro: state present at GRO time, slower prepare step 
> > > > > > > >> required
> > > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > > >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > > >>  __u8csum_not_inet:1;
> > > > > > > >>  #endif
> > > > > > > >> -
> > > > > > > >> +__u8devmem:1;
> > > > > > > >>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > > > >>  __u16   tc_index;   /* traffic 
> > > > > > > >> control index */
> > > > > > > >>  #endif
> > > > > > > >> @@ -1766,6 +1768,12 @@ static inline void 
> > > > > > > >> skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > > >>  __skb_zcopy_downgrade_managed(skb);
> > > > > > > >>  }
> > > > > > > >>
> > > > > > > >> +/* Return true if frags in this skb are not readable by the 
> > > > > > > >> host. */
> > > > > > > >> +static inline bool skb_frags_not_readable(const struct 
> > > > > > > >> sk_buff *skb)
> > > > > > > >> +{
> > > > > > > >> +return skb->devmem;
> > > > > > > >
> > > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 
> > > > > > > > 'not_readable'?
> > > > > > > > It better communicates the fact that the stack shouldn't 
> > > > > > > > dereference the
> > > > > > > > frags (because it has 'devmem' fragments or for some other 
> > > > > > > > potential
> > > > > > > > future reason).
> > > > > > >
> > > > > > > +1.
> > > > > > >
> > > > > > > Also, the flag on the skb is an optimization - a high level 
> > > > > > > signal that
> > > > > > > one or more frags is in unreadable memory. There is no 
> > > > > > > requirement that
> > > > > > > all of the frags are in the same memory type.
> > > > >
> > > > > 

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread Mina Almasry
On Mon, Nov 6, 2023 at 3:37 PM David Ahern  wrote:
>
> On 11/6/23 3:18 PM, Mina Almasry wrote:
> >> @@ -991,7 +993,7 @@ struct sk_buff {
> >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> >>  __u8csum_not_inet:1;
> >>  #endif
> >> -
> >> +__u8devmem:1;
> >>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> >>  __u16   tc_index;   /* traffic control index 
> >> */
> >>  #endif
> >> @@ -1766,6 +1768,12 @@ static inline void 
> >> skb_zcopy_downgrade_managed(struct sk_buff *skb)
> >>  __skb_zcopy_downgrade_managed(skb);
> >>  }
> >>
> >> +/* Return true if frags in this skb are not readable by the host. */
> >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> >> +{
> >> +return skb->devmem;
> >
> > bikeshedding: should we also rename 'devmem' sk_buff flag to 
> > 'not_readable'?
> > It better communicates the fact that the stack shouldn't dereference the
> > frags (because it has 'devmem' fragments or for some other potential
> > future reason).
> 
>  +1.
> 
>  Also, the flag on the skb is an optimization - a high level signal that
>  one or more frags is in unreadable memory. There is no requirement that
>  all of the frags are in the same memory type.
> >>
> >> David: maybe there should be such a requirement (that they all are
> >> unreadable)? Might be easier to support initially; we can relax later
> >> on.
> >>
> >
> > Currently devmem == not_readable, and the restriction is that all the
> > frags in the same skb must be either all readable or all unreadable
> > (all devmem or all non-devmem).
>
> What requires that restriction? In all of the uses of skb->devmem and
> skb_frags_not_readable() what matters is if any frag is not readable,
> then frag list walk or collapse is avoided.
>
>

Currently only tcp_recvmsg_devmem(), I think. tcp_recvmsg_locked()
delegates to tcp_recvmsg_devmem() if skb->devmem, and
tcp_recvmsg_devmem() net_err's if it finds a non-iov frag in the skb.
This is done for some simplicity, because iov's are given to the user
via cmsg, but pages are copied into the linear buffer. I think it
would be confusing for the user if we simultaneously copied some data
to the linear buffer and gave them a devmem cmsgs in the same
recvmsg() call.

So, my simplicity is:

1. in a single skb, all frags must be devmem or non-devmem, no mixing.
2. In a single recvmsg() call, we only process devmem or non-devmem
skbs, no mixing.

-- 
Thanks,
Mina


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread Stanislav Fomichev
On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry  wrote:
>
> On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev  wrote:
> >
> > On 11/06, Mina Almasry wrote:
> > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev  wrote:
> > > >
> > > > On 11/06, Mina Almasry wrote:
> > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern  
> > > > > wrote:
> > > > > >
> > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > >> For device memory TCP, we expect the skb headers to be available 
> > > > > > >> in host
> > > > > > >> memory for access, and we expect the skb frags to be in device 
> > > > > > >> memory
> > > > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > > > >> matching of device memory frags (unaccessible) with host memory 
> > > > > > >> frags
> > > > > > >> (accessible) in the same skb.
> > > > > > >>
> > > > > > >> Add a skb->devmem flag which indicates whether the frags in this 
> > > > > > >> skb
> > > > > > >> are device memory frags or not.
> > > > > > >>
> > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for 
> > > > > > >> page_pool_iovs,
> > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > >>
> > > > > > >> Add checks through the network stack to avoid accessing the 
> > > > > > >> frags of
> > > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem 
> > > > > > >> skbs.
> > > > > > >>
> > > > > > >> Signed-off-by: Willem de Bruijn 
> > > > > > >> Signed-off-by: Kaiyuan Zhang 
> > > > > > >> Signed-off-by: Mina Almasry 
> > > > > > >>
> > > > > > >> ---
> > > > > > >>  include/linux/skbuff.h | 14 +++-
> > > > > > >>  include/net/tcp.h  |  5 +--
> > > > > > >>  net/core/datagram.c|  6 
> > > > > > >>  net/core/gro.c |  5 ++-
> > > > > > >>  net/core/skbuff.c  | 77 
> > > > > > >> --
> > > > > > >>  net/ipv4/tcp.c |  6 
> > > > > > >>  net/ipv4/tcp_input.c   | 13 +--
> > > > > > >>  net/ipv4/tcp_output.c  |  5 ++-
> > > > > > >>  net/packet/af_packet.c |  4 +--
> > > > > > >>  9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > >>
> > > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > >> --- a/include/linux/skbuff.h
> > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > >>   *  @csum_level: indicates the number of consecutive checksums 
> > > > > > >> found in
> > > > > > >>   *  the packet minus one that have been verified as
> > > > > > >>   *  CHECKSUM_UNNECESSARY (max 3)
> > > > > > >> + *  @devmem: indicates that all the fragments in this skb are 
> > > > > > >> backed by
> > > > > > >> + *  device memory.
> > > > > > >>   *  @dst_pending_confirm: need to confirm neighbour
> > > > > > >>   *  @decrypted: Decrypted SKB
> > > > > > >>   *  @slow_gro: state present at GRO time, slower prepare step 
> > > > > > >> required
> > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > >>  __u8csum_not_inet:1;
> > > > > > >>  #endif
> > > > > > >> -
> > > > > > >> +__u8devmem:1;
> > > > > > >>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > > >>  __u16   tc_index;   /* traffic control 
> > > > > > >> index */
> > > > > > >>  #endif
> > > > > > >> @@ -1766,6 +1768,12 @@ static inline void 
> > > > > > >> skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > >>  __skb_zcopy_downgrade_managed(skb);
> > > > > > >>  }
> > > > > > >>
> > > > > > >> +/* Return true if frags in this skb are not readable by the 
> > > > > > >> host. */
> > > > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff 
> > > > > > >> *skb)
> > > > > > >> +{
> > > > > > >> +return skb->devmem;
> > > > > > >
> > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 
> > > > > > > 'not_readable'?
> > > > > > > It better communicates the fact that the stack shouldn't 
> > > > > > > dereference the
> > > > > > > frags (because it has 'devmem' fragments or for some other 
> > > > > > > potential
> > > > > > > future reason).
> > > > > >
> > > > > > +1.
> > > > > >
> > > > > > Also, the flag on the skb is an optimization - a high level signal 
> > > > > > that
> > > > > > one or more frags is in unreadable memory. There is no requirement 
> > > > > > that
> > > > > > all of the frags are in the same memory type.
> > > >
> > > > David: maybe there should be such a requirement (that they all are
> > > > unreadable)? Might be easier to support initially; we can relax later
> > > > on.
> > > >
> > >
> > > Currently devmem == not_readable, and the restriction is that all the
> > > frags in the same skb must be either all readable or 

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread David Ahern
On 11/6/23 3:18 PM, Mina Almasry wrote:
>> @@ -991,7 +993,7 @@ struct sk_buff {
>>  #if IS_ENABLED(CONFIG_IP_SCTP)
>>  __u8csum_not_inet:1;
>>  #endif
>> -
>> +__u8devmem:1;
>>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>>  __u16   tc_index;   /* traffic control index */
>>  #endif
>> @@ -1766,6 +1768,12 @@ static inline void 
>> skb_zcopy_downgrade_managed(struct sk_buff *skb)
>>  __skb_zcopy_downgrade_managed(skb);
>>  }
>>
>> +/* Return true if frags in this skb are not readable by the host. */
>> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
>> +{
>> +return skb->devmem;
>
> bikeshedding: should we also rename 'devmem' sk_buff flag to 
> 'not_readable'?
> It better communicates the fact that the stack shouldn't dereference the
> frags (because it has 'devmem' fragments or for some other potential
> future reason).

 +1.

 Also, the flag on the skb is an optimization - a high level signal that
 one or more frags is in unreadable memory. There is no requirement that
 all of the frags are in the same memory type.
>>
>> David: maybe there should be such a requirement (that they all are
>> unreadable)? Might be easier to support initially; we can relax later
>> on.
>>
> 
> Currently devmem == not_readable, and the restriction is that all the
> frags in the same skb must be either all readable or all unreadable
> (all devmem or all non-devmem).

What requires that restriction? In all of the uses of skb->devmem and
skb_frags_not_readable() what matters is if any frag is not readable,
then frag list walk or collapse is avoided.



Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread Mina Almasry
On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev  wrote:
>
> On 11/06, Mina Almasry wrote:
> > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev  wrote:
> > >
> > > On 11/06, Mina Almasry wrote:
> > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern  wrote:
> > > > >
> > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > On 11/05, Mina Almasry wrote:
> > > > > >> For device memory TCP, we expect the skb headers to be available 
> > > > > >> in host
> > > > > >> memory for access, and we expect the skb frags to be in device 
> > > > > >> memory
> > > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > > >> matching of device memory frags (unaccessible) with host memory 
> > > > > >> frags
> > > > > >> (accessible) in the same skb.
> > > > > >>
> > > > > >> Add a skb->devmem flag which indicates whether the frags in this 
> > > > > >> skb
> > > > > >> are device memory frags or not.
> > > > > >>
> > > > > >> __skb_fill_page_desc() now checks frags added to skbs for 
> > > > > >> page_pool_iovs,
> > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > >>
> > > > > >> Add checks through the network stack to avoid accessing the frags 
> > > > > >> of
> > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > > >>
> > > > > >> Signed-off-by: Willem de Bruijn 
> > > > > >> Signed-off-by: Kaiyuan Zhang 
> > > > > >> Signed-off-by: Mina Almasry 
> > > > > >>
> > > > > >> ---
> > > > > >>  include/linux/skbuff.h | 14 +++-
> > > > > >>  include/net/tcp.h  |  5 +--
> > > > > >>  net/core/datagram.c|  6 
> > > > > >>  net/core/gro.c |  5 ++-
> > > > > >>  net/core/skbuff.c  | 77 
> > > > > >> --
> > > > > >>  net/ipv4/tcp.c |  6 
> > > > > >>  net/ipv4/tcp_input.c   | 13 +--
> > > > > >>  net/ipv4/tcp_output.c  |  5 ++-
> > > > > >>  net/packet/af_packet.c |  4 +--
> > > > > >>  9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > >>
> > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > >> --- a/include/linux/skbuff.h
> > > > > >> +++ b/include/linux/skbuff.h
> > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > >>   *  @csum_level: indicates the number of consecutive checksums 
> > > > > >> found in
> > > > > >>   *  the packet minus one that have been verified as
> > > > > >>   *  CHECKSUM_UNNECESSARY (max 3)
> > > > > >> + *  @devmem: indicates that all the fragments in this skb are 
> > > > > >> backed by
> > > > > >> + *  device memory.
> > > > > >>   *  @dst_pending_confirm: need to confirm neighbour
> > > > > >>   *  @decrypted: Decrypted SKB
> > > > > >>   *  @slow_gro: state present at GRO time, slower prepare step 
> > > > > >> required
> > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > >>  __u8csum_not_inet:1;
> > > > > >>  #endif
> > > > > >> -
> > > > > >> +__u8devmem:1;
> > > > > >>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > >>  __u16   tc_index;   /* traffic control 
> > > > > >> index */
> > > > > >>  #endif
> > > > > >> @@ -1766,6 +1768,12 @@ static inline void 
> > > > > >> skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > >>  __skb_zcopy_downgrade_managed(skb);
> > > > > >>  }
> > > > > >>
> > > > > >> +/* Return true if frags in this skb are not readable by the host. 
> > > > > >> */
> > > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff 
> > > > > >> *skb)
> > > > > >> +{
> > > > > >> +return skb->devmem;
> > > > > >
> > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 
> > > > > > 'not_readable'?
> > > > > > It better communicates the fact that the stack shouldn't 
> > > > > > dereference the
> > > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > > future reason).
> > > > >
> > > > > +1.
> > > > >
> > > > > Also, the flag on the skb is an optimization - a high level signal 
> > > > > that
> > > > > one or more frags is in unreadable memory. There is no requirement 
> > > > > that
> > > > > all of the frags are in the same memory type.
> > >
> > > David: maybe there should be such a requirement (that they all are
> > > unreadable)? Might be easier to support initially; we can relax later
> > > on.
> > >
> >
> > Currently devmem == not_readable, and the restriction is that all the
> > frags in the same skb must be either all readable or all unreadable
> > (all devmem or all non-devmem).
> >
> > > > The flag indicates that the skb contains all devmem dma-buf memory
> > > > specifically, not generic 'not_readable' frags as the comment says:
> > > >
> > > > + * @devmem: indicates that all the fragments in this skb are 
> > > > backed by
> > 

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread Kaiyuan Zhang
>
> But there is still always 1 dmabuf to 1 socket association (on rx), right?
>
In practice yes, but my understanding is that such association is only
enforced by NIC features such as flow steering.

So why not have a separate control channel action to say: this socket fd
> is supposed to receive into this dmabuf fd?
>
 Are you proposing also adding the installation of corresponding flow
steering into this action? Or just add checks to make sure the flow
steering won't be removed?

On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev  wrote:

> On 11/06, Mina Almasry wrote:
> > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev 
> wrote:
> > >
> > > On 11/06, Mina Almasry wrote:
> > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern 
> wrote:
> > > > >
> > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > On 11/05, Mina Almasry wrote:
> > > > > >> For device memory TCP, we expect the skb headers to be
> available in host
> > > > > >> memory for access, and we expect the skb frags to be in device
> memory
> > > > > >> and unaccessible to the host. We expect there to be no mixing
> and
> > > > > >> matching of device memory frags (unaccessible) with host memory
> frags
> > > > > >> (accessible) in the same skb.
> > > > > >>
> > > > > >> Add a skb->devmem flag which indicates whether the frags in
> this skb
> > > > > >> are device memory frags or not.
> > > > > >>
> > > > > >> __skb_fill_page_desc() now checks frags added to skbs for
> page_pool_iovs,
> > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > >>
> > > > > >> Add checks through the network stack to avoid accessing the
> frags of
> > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem
> skbs.
> > > > > >>
> > > > > >> Signed-off-by: Willem de Bruijn 
> > > > > >> Signed-off-by: Kaiyuan Zhang 
> > > > > >> Signed-off-by: Mina Almasry 
> > > > > >>
> > > > > >> ---
> > > > > >>  include/linux/skbuff.h | 14 +++-
> > > > > >>  include/net/tcp.h  |  5 +--
> > > > > >>  net/core/datagram.c|  6 
> > > > > >>  net/core/gro.c |  5 ++-
> > > > > >>  net/core/skbuff.c  | 77
> --
> > > > > >>  net/ipv4/tcp.c |  6 
> > > > > >>  net/ipv4/tcp_input.c   | 13 +--
> > > > > >>  net/ipv4/tcp_output.c  |  5 ++-
> > > > > >>  net/packet/af_packet.c |  4 +--
> > > > > >>  9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > >>
> > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > >> --- a/include/linux/skbuff.h
> > > > > >> +++ b/include/linux/skbuff.h
> > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > >>   *  @csum_level: indicates the number of consecutive checksums
> found in
> > > > > >>   *  the packet minus one that have been verified as
> > > > > >>   *  CHECKSUM_UNNECESSARY (max 3)
> > > > > >> + *  @devmem: indicates that all the fragments in this skb are
> backed by
> > > > > >> + *  device memory.
> > > > > >>   *  @dst_pending_confirm: need to confirm neighbour
> > > > > >>   *  @decrypted: Decrypted SKB
> > > > > >>   *  @slow_gro: state present at GRO time, slower prepare step
> required
> > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > >>  __u8csum_not_inet:1;
> > > > > >>  #endif
> > > > > >> -
> > > > > >> +__u8devmem:1;
> > > > > >>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > >>  __u16   tc_index;   /* traffic control
> index */
> > > > > >>  #endif
> > > > > >> @@ -1766,6 +1768,12 @@ static inline void
> skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > >>  __skb_zcopy_downgrade_managed(skb);
> > > > > >>  }
> > > > > >>
> > > > > >> +/* Return true if frags in this skb are not readable by the
> host. */
> > > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff
> *skb)
> > > > > >> +{
> > > > > >> +return skb->devmem;
> > > > > >
> > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to
> 'not_readable'?
> > > > > > It better communicates the fact that the stack shouldn't
> dereference the
> > > > > > frags (because it has 'devmem' fragments or for some other
> potential
> > > > > > future reason).
> > > > >
> > > > > +1.
> > > > >
> > > > > Also, the flag on the skb is an optimization - a high level signal
> that
> > > > > one or more frags is in unreadable memory. There is no requirement
> that
> > > > > all of the frags are in the same memory type.
> > >
> > > David: maybe there should be such a requirement (that they all are
> > > unreadable)? Might be easier to support initially; we can relax later
> > > on.
> > >
> >
> > Currently devmem == not_readable, and the restriction is that all the
> > frags in the same skb must be either all readable or all unreadable
> > 

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread Stanislav Fomichev
On 11/06, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev  wrote:
> >
> > On 11/06, Mina Almasry wrote:
> > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern  wrote:
> > > >
> > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > On 11/05, Mina Almasry wrote:
> > > > >> For device memory TCP, we expect the skb headers to be available in 
> > > > >> host
> > > > >> memory for access, and we expect the skb frags to be in device memory
> > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > >> matching of device memory frags (unaccessible) with host memory frags
> > > > >> (accessible) in the same skb.
> > > > >>
> > > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > > >> are device memory frags or not.
> > > > >>
> > > > >> __skb_fill_page_desc() now checks frags added to skbs for 
> > > > >> page_pool_iovs,
> > > > >> and marks the skb as skb->devmem accordingly.
> > > > >>
> > > > >> Add checks through the network stack to avoid accessing the frags of
> > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > >>
> > > > >> Signed-off-by: Willem de Bruijn 
> > > > >> Signed-off-by: Kaiyuan Zhang 
> > > > >> Signed-off-by: Mina Almasry 
> > > > >>
> > > > >> ---
> > > > >>  include/linux/skbuff.h | 14 +++-
> > > > >>  include/net/tcp.h  |  5 +--
> > > > >>  net/core/datagram.c|  6 
> > > > >>  net/core/gro.c |  5 ++-
> > > > >>  net/core/skbuff.c  | 77 
> > > > >> --
> > > > >>  net/ipv4/tcp.c |  6 
> > > > >>  net/ipv4/tcp_input.c   | 13 +--
> > > > >>  net/ipv4/tcp_output.c  |  5 ++-
> > > > >>  net/packet/af_packet.c |  4 +--
> > > > >>  9 files changed, 115 insertions(+), 20 deletions(-)
> > > > >>
> > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > >> --- a/include/linux/skbuff.h
> > > > >> +++ b/include/linux/skbuff.h
> > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > >>   *  @csum_level: indicates the number of consecutive checksums 
> > > > >> found in
> > > > >>   *  the packet minus one that have been verified as
> > > > >>   *  CHECKSUM_UNNECESSARY (max 3)
> > > > >> + *  @devmem: indicates that all the fragments in this skb are 
> > > > >> backed by
> > > > >> + *  device memory.
> > > > >>   *  @dst_pending_confirm: need to confirm neighbour
> > > > >>   *  @decrypted: Decrypted SKB
> > > > >>   *  @slow_gro: state present at GRO time, slower prepare step 
> > > > >> required
> > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > >>  __u8csum_not_inet:1;
> > > > >>  #endif
> > > > >> -
> > > > >> +__u8devmem:1;
> > > > >>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > >>  __u16   tc_index;   /* traffic control 
> > > > >> index */
> > > > >>  #endif
> > > > >> @@ -1766,6 +1768,12 @@ static inline void 
> > > > >> skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > >>  __skb_zcopy_downgrade_managed(skb);
> > > > >>  }
> > > > >>
> > > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > > >> +{
> > > > >> +return skb->devmem;
> > > > >
> > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 
> > > > > 'not_readable'?
> > > > > It better communicates the fact that the stack shouldn't dereference 
> > > > > the
> > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > future reason).
> > > >
> > > > +1.
> > > >
> > > > Also, the flag on the skb is an optimization - a high level signal that
> > > > one or more frags is in unreadable memory. There is no requirement that
> > > > all of the frags are in the same memory type.
> >
> > David: maybe there should be such a requirement (that they all are
> > unreadable)? Might be easier to support initially; we can relax later
> > on.
> >
> 
> Currently devmem == not_readable, and the restriction is that all the
> frags in the same skb must be either all readable or all unreadable
> (all devmem or all non-devmem).
> 
> > > The flag indicates that the skb contains all devmem dma-buf memory
> > > specifically, not generic 'not_readable' frags as the comment says:
> > >
> > > + * @devmem: indicates that all the fragments in this skb are backed 
> > > by
> > > + * device memory.
> > >
> > > The reason it's not a generic 'not_readable' flag is because handing
> > > off a generic not_readable skb to the userspace is semantically not
> > > what we're doing. recvmsg() is augmented in this patch series to
> > > return a devmem skb to the user via a cmsg_devmem struct which refers
> > > specifically to the memory in the 

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread Mina Almasry
On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev  wrote:
>
> On 11/06, Mina Almasry wrote:
> > On Mon, Nov 6, 2023 at 11:34 AM David Ahern  wrote:
> > >
> > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > On 11/05, Mina Almasry wrote:
> > > >> For device memory TCP, we expect the skb headers to be available in 
> > > >> host
> > > >> memory for access, and we expect the skb frags to be in device memory
> > > >> and unaccessible to the host. We expect there to be no mixing and
> > > >> matching of device memory frags (unaccessible) with host memory frags
> > > >> (accessible) in the same skb.
> > > >>
> > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > >> are device memory frags or not.
> > > >>
> > > >> __skb_fill_page_desc() now checks frags added to skbs for 
> > > >> page_pool_iovs,
> > > >> and marks the skb as skb->devmem accordingly.
> > > >>
> > > >> Add checks through the network stack to avoid accessing the frags of
> > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > >>
> > > >> Signed-off-by: Willem de Bruijn 
> > > >> Signed-off-by: Kaiyuan Zhang 
> > > >> Signed-off-by: Mina Almasry 
> > > >>
> > > >> ---
> > > >>  include/linux/skbuff.h | 14 +++-
> > > >>  include/net/tcp.h  |  5 +--
> > > >>  net/core/datagram.c|  6 
> > > >>  net/core/gro.c |  5 ++-
> > > >>  net/core/skbuff.c  | 77 --
> > > >>  net/ipv4/tcp.c |  6 
> > > >>  net/ipv4/tcp_input.c   | 13 +--
> > > >>  net/ipv4/tcp_output.c  |  5 ++-
> > > >>  net/packet/af_packet.c |  4 +--
> > > >>  9 files changed, 115 insertions(+), 20 deletions(-)
> > > >>
> > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > >> index 1fae276c1353..8fb468ff8115 100644
> > > >> --- a/include/linux/skbuff.h
> > > >> +++ b/include/linux/skbuff.h
> > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > >>   *  @csum_level: indicates the number of consecutive checksums found 
> > > >> in
> > > >>   *  the packet minus one that have been verified as
> > > >>   *  CHECKSUM_UNNECESSARY (max 3)
> > > >> + *  @devmem: indicates that all the fragments in this skb are backed 
> > > >> by
> > > >> + *  device memory.
> > > >>   *  @dst_pending_confirm: need to confirm neighbour
> > > >>   *  @decrypted: Decrypted SKB
> > > >>   *  @slow_gro: state present at GRO time, slower prepare step required
> > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> > > >>  __u8csum_not_inet:1;
> > > >>  #endif
> > > >> -
> > > >> +__u8devmem:1;
> > > >>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > >>  __u16   tc_index;   /* traffic control index 
> > > >> */
> > > >>  #endif
> > > >> @@ -1766,6 +1768,12 @@ static inline void 
> > > >> skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > >>  __skb_zcopy_downgrade_managed(skb);
> > > >>  }
> > > >>
> > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > >> +{
> > > >> +return skb->devmem;
> > > >
> > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 
> > > > 'not_readable'?
> > > > It better communicates the fact that the stack shouldn't dereference the
> > > > frags (because it has 'devmem' fragments or for some other potential
> > > > future reason).
> > >
> > > +1.
> > >
> > > Also, the flag on the skb is an optimization - a high level signal that
> > > one or more frags is in unreadable memory. There is no requirement that
> > > all of the frags are in the same memory type.
>
> David: maybe there should be such a requirement (that they all are
> unreadable)? Might be easier to support initially; we can relax later
> on.
>

Currently devmem == not_readable, and the restriction is that all the
frags in the same skb must be either all readable or all unreadable
(all devmem or all non-devmem).

> > The flag indicates that the skb contains all devmem dma-buf memory
> > specifically, not generic 'not_readable' frags as the comment says:
> >
> > + * @devmem: indicates that all the fragments in this skb are backed by
> > + * device memory.
> >
> > The reason it's not a generic 'not_readable' flag is because handing
> > off a generic not_readable skb to the userspace is semantically not
> > what we're doing. recvmsg() is augmented in this patch series to
> > return a devmem skb to the user via a cmsg_devmem struct which refers
> > specifically to the memory in the dma-buf. recvmsg() in this patch
> > series is not augmented to give any 'not_readable' skb to the
> > userspace.
> >
> > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > correct. If a new type of unreadable skbs are introduced to the stack,
> > I imagine the stack 

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread Stanislav Fomichev
On 11/06, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 11:34 AM David Ahern  wrote:
> >
> > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > On 11/05, Mina Almasry wrote:
> > >> For device memory TCP, we expect the skb headers to be available in host
> > >> memory for access, and we expect the skb frags to be in device memory
> > >> and unaccessible to the host. We expect there to be no mixing and
> > >> matching of device memory frags (unaccessible) with host memory frags
> > >> (accessible) in the same skb.
> > >>
> > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > >> are device memory frags or not.
> > >>
> > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > >> and marks the skb as skb->devmem accordingly.
> > >>
> > >> Add checks through the network stack to avoid accessing the frags of
> > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > >>
> > >> Signed-off-by: Willem de Bruijn 
> > >> Signed-off-by: Kaiyuan Zhang 
> > >> Signed-off-by: Mina Almasry 
> > >>
> > >> ---
> > >>  include/linux/skbuff.h | 14 +++-
> > >>  include/net/tcp.h  |  5 +--
> > >>  net/core/datagram.c|  6 
> > >>  net/core/gro.c |  5 ++-
> > >>  net/core/skbuff.c  | 77 --
> > >>  net/ipv4/tcp.c |  6 
> > >>  net/ipv4/tcp_input.c   | 13 +--
> > >>  net/ipv4/tcp_output.c  |  5 ++-
> > >>  net/packet/af_packet.c |  4 +--
> > >>  9 files changed, 115 insertions(+), 20 deletions(-)
> > >>
> > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > >> index 1fae276c1353..8fb468ff8115 100644
> > >> --- a/include/linux/skbuff.h
> > >> +++ b/include/linux/skbuff.h
> > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > >>   *  @csum_level: indicates the number of consecutive checksums found in
> > >>   *  the packet minus one that have been verified as
> > >>   *  CHECKSUM_UNNECESSARY (max 3)
> > >> + *  @devmem: indicates that all the fragments in this skb are backed by
> > >> + *  device memory.
> > >>   *  @dst_pending_confirm: need to confirm neighbour
> > >>   *  @decrypted: Decrypted SKB
> > >>   *  @slow_gro: state present at GRO time, slower prepare step required
> > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> > >>  __u8csum_not_inet:1;
> > >>  #endif
> > >> -
> > >> +__u8devmem:1;
> > >>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > >>  __u16   tc_index;   /* traffic control index */
> > >>  #endif
> > >> @@ -1766,6 +1768,12 @@ static inline void 
> > >> skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > >>  __skb_zcopy_downgrade_managed(skb);
> > >>  }
> > >>
> > >> +/* Return true if frags in this skb are not readable by the host. */
> > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > >> +{
> > >> +return skb->devmem;
> > >
> > > bikeshedding: should we also rename 'devmem' sk_buff flag to 
> > > 'not_readable'?
> > > It better communicates the fact that the stack shouldn't dereference the
> > > frags (because it has 'devmem' fragments or for some other potential
> > > future reason).
> >
> > +1.
> >
> > Also, the flag on the skb is an optimization - a high level signal that
> > one or more frags is in unreadable memory. There is no requirement that
> > all of the frags are in the same memory type.

David: maybe there should be such a requirement (that they all are
unreadable)? Might be easier to support initially; we can relax later
on.

> The flag indicates that the skb contains all devmem dma-buf memory
> specifically, not generic 'not_readable' frags as the comment says:
> 
> + * @devmem: indicates that all the fragments in this skb are backed by
> + * device memory.
> 
> The reason it's not a generic 'not_readable' flag is because handing
> off a generic not_readable skb to the userspace is semantically not
> what we're doing. recvmsg() is augmented in this patch series to
> return a devmem skb to the user via a cmsg_devmem struct which refers
> specifically to the memory in the dma-buf. recvmsg() in this patch
> series is not augmented to give any 'not_readable' skb to the
> userspace.
> 
> IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> correct. If a new type of unreadable skbs are introduced to the stack,
> I imagine the stack would implement:
> 
> 1. new header flag: skb->newmem
> 2.
> 
> static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> {
> return skb->devmem || skb->newmem;
> }
> 
> 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.

You copy it to the userspace in a special way because your frags
are page_is_page_pool_iov(). I agree with David, the skb bit is
just and optimization.


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread Stanislav Fomichev
On 11/05, Mina Almasry wrote:
> For device memory TCP, we expect the skb headers to be available in host
> memory for access, and we expect the skb frags to be in device memory
> and unaccessible to the host. We expect there to be no mixing and
> matching of device memory frags (unaccessible) with host memory frags
> (accessible) in the same skb.
> 
> Add a skb->devmem flag which indicates whether the frags in this skb
> are device memory frags or not.
> 
> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> and marks the skb as skb->devmem accordingly.
> 
> Add checks through the network stack to avoid accessing the frags of
> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> 
> Signed-off-by: Willem de Bruijn 
> Signed-off-by: Kaiyuan Zhang 
> Signed-off-by: Mina Almasry 

[..]
 
> - snaplen = skb->len;
> + snaplen = skb_frags_not_readable(skb) ? skb_headlen(skb) : skb->len;
>  
>   res = run_filter(skb, sk, snaplen);
>   if (!res)
> @@ -2279,7 +2279,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
> net_device *dev,
>   }
>   }
>  
> - snaplen = skb->len;
> + snaplen = skb_frags_not_readable(skb) ? skb_headlen(skb) : skb->len;
>  
>   res = run_filter(skb, sk, snaplen);
>   if (!res)

Not sure it covers 100% of bpf. We might need to double-check bpf_xdp_copy_buf
which is having its own, non-skb shinfo and frags. And in general, xdp
can reference those shinfo frags early... (xdp part happens
before we create an skb with all devmem association)


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread Mina Almasry
On Mon, Nov 6, 2023 at 11:34 AM David Ahern  wrote:
>
> On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > On 11/05, Mina Almasry wrote:
> >> For device memory TCP, we expect the skb headers to be available in host
> >> memory for access, and we expect the skb frags to be in device memory
> >> and unaccessible to the host. We expect there to be no mixing and
> >> matching of device memory frags (unaccessible) with host memory frags
> >> (accessible) in the same skb.
> >>
> >> Add a skb->devmem flag which indicates whether the frags in this skb
> >> are device memory frags or not.
> >>
> >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> >> and marks the skb as skb->devmem accordingly.
> >>
> >> Add checks through the network stack to avoid accessing the frags of
> >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> >>
> >> Signed-off-by: Willem de Bruijn 
> >> Signed-off-by: Kaiyuan Zhang 
> >> Signed-off-by: Mina Almasry 
> >>
> >> ---
> >>  include/linux/skbuff.h | 14 +++-
> >>  include/net/tcp.h  |  5 +--
> >>  net/core/datagram.c|  6 
> >>  net/core/gro.c |  5 ++-
> >>  net/core/skbuff.c  | 77 --
> >>  net/ipv4/tcp.c |  6 
> >>  net/ipv4/tcp_input.c   | 13 +--
> >>  net/ipv4/tcp_output.c  |  5 ++-
> >>  net/packet/af_packet.c |  4 +--
> >>  9 files changed, 115 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> >> index 1fae276c1353..8fb468ff8115 100644
> >> --- a/include/linux/skbuff.h
> >> +++ b/include/linux/skbuff.h
> >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> >>   *  @csum_level: indicates the number of consecutive checksums found in
> >>   *  the packet minus one that have been verified as
> >>   *  CHECKSUM_UNNECESSARY (max 3)
> >> + *  @devmem: indicates that all the fragments in this skb are backed by
> >> + *  device memory.
> >>   *  @dst_pending_confirm: need to confirm neighbour
> >>   *  @decrypted: Decrypted SKB
> >>   *  @slow_gro: state present at GRO time, slower prepare step required
> >> @@ -991,7 +993,7 @@ struct sk_buff {
> >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> >>  __u8csum_not_inet:1;
> >>  #endif
> >> -
> >> +__u8devmem:1;
> >>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> >>  __u16   tc_index;   /* traffic control index */
> >>  #endif
> >> @@ -1766,6 +1768,12 @@ static inline void 
> >> skb_zcopy_downgrade_managed(struct sk_buff *skb)
> >>  __skb_zcopy_downgrade_managed(skb);
> >>  }
> >>
> >> +/* Return true if frags in this skb are not readable by the host. */
> >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> >> +{
> >> +return skb->devmem;
> >
> > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > It better communicates the fact that the stack shouldn't dereference the
> > frags (because it has 'devmem' fragments or for some other potential
> > future reason).
>
> +1.
>
> Also, the flag on the skb is an optimization - a high level signal that
> one or more frags is in unreadable memory. There is no requirement that
> all of the frags are in the same memory type.

The flag indicates that the skb contains all devmem dma-buf memory
specifically, not generic 'not_readable' frags as the comment says:

+ * @devmem: indicates that all the fragments in this skb are backed by
+ * device memory.

The reason it's not a generic 'not_readable' flag is because handing
off a generic not_readable skb to the userspace is semantically not
what we're doing. recvmsg() is augmented in this patch series to
return a devmem skb to the user via a cmsg_devmem struct which refers
specifically to the memory in the dma-buf. recvmsg() in this patch
series is not augmented to give any 'not_readable' skb to the
userspace.

IMHO skb->devmem + an skb_frags_not_readable() as implemented is
correct. If a new type of unreadable skbs are introduced to the stack,
I imagine the stack would implement:

1. new header flag: skb->newmem
2.

static inline bool skb_frags_not_readable(const struct skb_buff *skb)
{
return skb->devmem || skb->newmem;
}

3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.

-- 
Thanks,
Mina


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread David Ahern
On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> On 11/05, Mina Almasry wrote:
>> For device memory TCP, we expect the skb headers to be available in host
>> memory for access, and we expect the skb frags to be in device memory
>> and unaccessible to the host. We expect there to be no mixing and
>> matching of device memory frags (unaccessible) with host memory frags
>> (accessible) in the same skb.
>>
>> Add a skb->devmem flag which indicates whether the frags in this skb
>> are device memory frags or not.
>>
>> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
>> and marks the skb as skb->devmem accordingly.
>>
>> Add checks through the network stack to avoid accessing the frags of
>> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
>>
>> Signed-off-by: Willem de Bruijn 
>> Signed-off-by: Kaiyuan Zhang 
>> Signed-off-by: Mina Almasry 
>>
>> ---
>>  include/linux/skbuff.h | 14 +++-
>>  include/net/tcp.h  |  5 +--
>>  net/core/datagram.c|  6 
>>  net/core/gro.c |  5 ++-
>>  net/core/skbuff.c  | 77 --
>>  net/ipv4/tcp.c |  6 
>>  net/ipv4/tcp_input.c   | 13 +--
>>  net/ipv4/tcp_output.c  |  5 ++-
>>  net/packet/af_packet.c |  4 +--
>>  9 files changed, 115 insertions(+), 20 deletions(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 1fae276c1353..8fb468ff8115 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
>>   *  @csum_level: indicates the number of consecutive checksums found in
>>   *  the packet minus one that have been verified as
>>   *  CHECKSUM_UNNECESSARY (max 3)
>> + *  @devmem: indicates that all the fragments in this skb are backed by
>> + *  device memory.
>>   *  @dst_pending_confirm: need to confirm neighbour
>>   *  @decrypted: Decrypted SKB
>>   *  @slow_gro: state present at GRO time, slower prepare step required
>> @@ -991,7 +993,7 @@ struct sk_buff {
>>  #if IS_ENABLED(CONFIG_IP_SCTP)
>>  __u8csum_not_inet:1;
>>  #endif
>> -
>> +__u8devmem:1;
>>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>>  __u16   tc_index;   /* traffic control index */
>>  #endif
>> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct 
>> sk_buff *skb)
>>  __skb_zcopy_downgrade_managed(skb);
>>  }
>>  
>> +/* Return true if frags in this skb are not readable by the host. */
>> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
>> +{
>> +return skb->devmem;
> 
> bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> It better communicates the fact that the stack shouldn't dereference the
> frags (because it has 'devmem' fragments or for some other potential
> future reason).

+1.

Also, the flag on the skb is an optimization - a high level signal that
one or more frags is in unreadable memory. There is no requirement that
all of the frags are in the same memory type.


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread Stanislav Fomichev
On 11/05, Mina Almasry wrote:
> For device memory TCP, we expect the skb headers to be available in host
> memory for access, and we expect the skb frags to be in device memory
> and unaccessible to the host. We expect there to be no mixing and
> matching of device memory frags (unaccessible) with host memory frags
> (accessible) in the same skb.
> 
> Add a skb->devmem flag which indicates whether the frags in this skb
> are device memory frags or not.
> 
> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> and marks the skb as skb->devmem accordingly.
> 
> Add checks through the network stack to avoid accessing the frags of
> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> 
> Signed-off-by: Willem de Bruijn 
> Signed-off-by: Kaiyuan Zhang 
> Signed-off-by: Mina Almasry 
> 
> ---
>  include/linux/skbuff.h | 14 +++-
>  include/net/tcp.h  |  5 +--
>  net/core/datagram.c|  6 
>  net/core/gro.c |  5 ++-
>  net/core/skbuff.c  | 77 --
>  net/ipv4/tcp.c |  6 
>  net/ipv4/tcp_input.c   | 13 +--
>  net/ipv4/tcp_output.c  |  5 ++-
>  net/packet/af_packet.c |  4 +--
>  9 files changed, 115 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 1fae276c1353..8fb468ff8115 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
>   *   @csum_level: indicates the number of consecutive checksums found in
>   *   the packet minus one that have been verified as
>   *   CHECKSUM_UNNECESSARY (max 3)
> + *   @devmem: indicates that all the fragments in this skb are backed by
> + *   device memory.
>   *   @dst_pending_confirm: need to confirm neighbour
>   *   @decrypted: Decrypted SKB
>   *   @slow_gro: state present at GRO time, slower prepare step required
> @@ -991,7 +993,7 @@ struct sk_buff {
>  #if IS_ENABLED(CONFIG_IP_SCTP)
>   __u8csum_not_inet:1;
>  #endif
> -
> + __u8devmem:1;
>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>   __u16   tc_index;   /* traffic control index */
>  #endif
> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct 
> sk_buff *skb)
>   __skb_zcopy_downgrade_managed(skb);
>  }
>  
> +/* Return true if frags in this skb are not readable by the host. */
> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> +{
> + return skb->devmem;

bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
It better communicates the fact that the stack shouldn't dereference the
frags (because it has 'devmem' fragments or for some other potential
future reason).