Re: [PATCH net-next v10 02/14] net: page_pool: create hooks for custom page providers

2024-06-09 Thread David Wei
On 2024-06-07 17:52, Jason Gunthorpe wrote:
> IMHO it seems to compose poorly if you can only use the io_uring
> lifecycle model with io_uring registered memory, and not with DMABUF
> memory registered through Mina's mechanism.

By this, do you mean io_uring must be exclusively used to use this
feature?

And you'd rather see the two decoupled, so userspace can register w/ say
dmabuf then pass it to io_uring?

> 
> Jason


Re: [PATCH net-next v10 02/14] net: page_pool: create hooks for custom page providers

2024-06-09 Thread David Wei
On 2024-06-07 17:27, David Ahern wrote:
> I also do not understand why the ifq cache and overloading xdp functions
> have stuck around; I always thought both were added by Jonathan to
> simplify kernel ports during early POC days.

Setting up an Rx queue for ZC w/ a different pp will be done properly
using the new queue API that Mina merged recently. Those custom XDP
hooks will be gone in a non-RFC patchset.


Re: [PATCH net-next v10 01/14] netdev: add netdev_rx_queue_restart()

2024-05-30 Thread David Wei
On 2024-05-30 13:16, Mina Almasry wrote:
[...]
> +err_start_queue:
> + /* Restarting the queue with old_mem should be successful as we haven't
> +  * changed any of the queue configuration, and there is not much we can
> +  * do to recover from a failure here.
> +  *
> +  * WARN if the we fail to recover the old rx queue, and at least free
> +  * old_mem so we don't also leak that.
> +  */
> + if (dev->queue_mgmt_ops->ndo_queue_start(dev, old_mem, rxq_idx)) {
> + WARN(1,
> +  "Failed to restart old queue in error path. RX queue %d 
> may be unhealthy.",
> +  rxq_idx);
> + dev->queue_mgmt_ops->ndo_queue_mem_free(dev, _mem);

This should be ->ndo_queue_mem_free(dev, old_mem).


Re: [PATCH net-next v9 11/14] tcp: RX path for devmem TCP

2024-05-23 Thread David Wei
On 2024-05-10 16:21, Mina Almasry wrote:
> +/* On error, returns the -errno. On success, returns number of bytes sent to 
> the
> + * user. May not consume all of @remaining_len.
> + */
> +static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
> +   unsigned int offset, struct msghdr *msg,
> +   int remaining_len)
> +{
> + struct dmabuf_cmsg dmabuf_cmsg = { 0 };
> + struct tcp_xa_pool tcp_xa_pool;
> + unsigned int start;
> + int i, copy, n;
> + int sent = 0;
> + int err = 0;
> +
> + tcp_xa_pool.max = 0;
> + tcp_xa_pool.idx = 0;
> + do {
> + start = skb_headlen(skb);
> +
> + if (skb_frags_readable(skb)) {
> + err = -ENODEV;
> + goto out;
> + }
> +
> + /* Copy header. */
> + copy = start - offset;
> + if (copy > 0) {
> + copy = min(copy, remaining_len);
> +
> + n = copy_to_iter(skb->data + offset, copy,
> +  >msg_iter);
> + if (n != copy) {
> + err = -EFAULT;
> + goto out;
> + }
> +
> + offset += copy;
> + remaining_len -= copy;
> +
> + /* First a dmabuf_cmsg for # bytes copied to user
> +  * buffer.
> +  */
> + memset(_cmsg, 0, sizeof(dmabuf_cmsg));
> + dmabuf_cmsg.frag_size = copy;
> + err = put_cmsg(msg, SOL_SOCKET, SO_DEVMEM_LINEAR,
> +sizeof(dmabuf_cmsg), _cmsg);
> + if (err || msg->msg_flags & MSG_CTRUNC) {
> + msg->msg_flags &= ~MSG_CTRUNC;
> + if (!err)
> + err = -ETOOSMALL;
> + goto out;
> + }
> +
> + sent += copy;
> +
> + if (remaining_len == 0)
> + goto out;
> + }
> +
> + /* after that, send information of dmabuf pages through a
> +  * sequence of cmsg
> +  */
> + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> + skb_frag_t *frag = _shinfo(skb)->frags[i];
> + struct net_iov *niov;
> + u64 frag_offset;
> + int end;
> +
> + /* !skb_frags_readable() should indicate that ALL the
> +  * frags in this skb are dmabuf net_iovs. We're checking
> +  * for that flag above, but also check individual frags
> +  * here. If the tcp stack is not setting
> +  * skb_frags_readable() correctly, we still don't want
> +  * to crash here.
> +  */
> + if (!skb_frag_net_iov(frag)) {
> + net_err_ratelimited("Found non-dmabuf skb with 
> net_iov");
> + err = -ENODEV;
> + goto out;
> + }
> +
> + niov = skb_frag_net_iov(frag);

Sorry if we've already discussed this.

We have this additional hunk:

+ if (niov->pp->mp_ops != _devmem_ops) {
+   err = -ENODEV;
+   goto out;
+ }

In case one of our skbs end up here, skb_frag_is_net_iov() and
!skb_frags_readable(). Does this even matter? And if so then is there a
better way to distinguish between our two types of net_iovs?


Re: [PATCH net-next v9 04/14] netdev: support binding dma-buf to netdevice

2024-05-18 Thread David Wei
On 2024-05-10 16:21, Mina Almasry wrote:
> -/* Stub */
>  int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
>  {
> - return 0;
> + struct nlattr *tb[ARRAY_SIZE(netdev_queue_dmabuf_nl_policy)];
> + struct net_devmem_dmabuf_binding *out_binding;
> + struct list_head *sock_binding_list;
> + u32 ifindex, dmabuf_fd, rxq_idx;
> + struct net_device *netdev;
> + struct sk_buff *rsp;
> + struct nlattr *attr;
> + int rem, err = 0;
> + void *hdr;
> +
> + if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_DEV_IFINDEX) ||
> + GENL_REQ_ATTR_CHECK(info, NETDEV_A_BIND_DMABUF_DMABUF_FD) ||
> + GENL_REQ_ATTR_CHECK(info, NETDEV_A_BIND_DMABUF_QUEUES))
> + return -EINVAL;
> +
> + ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]);
> + dmabuf_fd = nla_get_u32(info->attrs[NETDEV_A_BIND_DMABUF_DMABUF_FD]);
> +
> + rtnl_lock();
> +
> + netdev = __dev_get_by_index(genl_info_net(info), ifindex);
> + if (!netdev) {
> + err = -ENODEV;
> + goto err_unlock;
> + }
> +
> + err = net_devmem_bind_dmabuf(netdev, dmabuf_fd, _binding);
> + if (err)
> + goto err_unlock;
> +
> + nla_for_each_attr(attr, genlmsg_data(info->genlhdr),
> +   genlmsg_len(info->genlhdr), rem) {
> + if (nla_type(attr) != NETDEV_A_BIND_DMABUF_QUEUES)
> + continue;
> +
> + err = nla_parse_nested(
> + tb, ARRAY_SIZE(netdev_queue_dmabuf_nl_policy) - 1, attr,
> + netdev_queue_dmabuf_nl_policy, info->extack);
> + if (err < 0)
> + goto err_unbind;
> +
> + rxq_idx = nla_get_u32(tb[NETDEV_A_QUEUE_DMABUF_IDX]);
> + if (rxq_idx >= netdev->num_rx_queues) {
> + err = -ERANGE;
> + goto err_unbind;
> + }

net_devmem_bind_dmabuf_to_queue() checks for rxq_idx >=
netdev->num_rx_queues as well. I'd say remove the one in
netdev_nl_bind_rx_doit().

Also we may want a generic netdev function e.g. netdev_rx_queue_set_mp()
since I need the same functionality.

> +
> + err = net_devmem_bind_dmabuf_to_queue(netdev, rxq_idx,
> +   out_binding);
> + if (err)
> + goto err_unbind;
> + }
> +
> + sock_binding_list = genl_sk_priv_get(_nl_family,
> +  NETLINK_CB(skb).sk);
> + if (IS_ERR(sock_binding_list)) {
> + err = PTR_ERR(sock_binding_list);
> + goto err_unbind;
> + }
> +
> + list_add(_binding->list, sock_binding_list);
> +
> + rsp = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL);
> + if (!rsp) {
> + err = -ENOMEM;
> + goto err_unbind;
> + }
> +
> + hdr = genlmsg_iput(rsp, info);
> + if (!hdr) {
> + err = -EMSGSIZE;
> + goto err_genlmsg_free;
> + }
> +
> + nla_put_u32(rsp, NETDEV_A_BIND_DMABUF_DMABUF_ID, out_binding->id);
> + genlmsg_end(rsp, hdr);
> +
> + rtnl_unlock();
> +
> + return genlmsg_reply(rsp, info);
> +
> +err_genlmsg_free:
> + nlmsg_free(rsp);
> +err_unbind:
> + net_devmem_unbind_dmabuf(out_binding);
> +err_unlock:
> + rtnl_unlock();
> + return err;
>  }


Re: [PATCH net-next v9 04/14] netdev: support binding dma-buf to netdevice

2024-05-18 Thread David Wei
On 2024-05-10 16:21, Mina Almasry wrote:
> +void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> +{
> + struct netdev_rx_queue *rxq;
> + unsigned long xa_idx;
> + unsigned int rxq_idx;
> +
> + if (!binding)
> + return;
> +
> + if (binding->list.next)
> + list_del(>list);
> +
> + xa_for_each(>bound_rxq_list, xa_idx, rxq) {
> + if (rxq->mp_params.mp_priv == binding) {
> + /* We hold the rtnl_lock while binding/unbinding
> +  * dma-buf, so we can't race with another thread that
> +  * is also modifying this value. However, the page_pool
> +  * may read this config while it's creating its
> +  * rx-queues. WRITE_ONCE() here to match the
> +  * READ_ONCE() in the page_pool.
> +  */
> + WRITE_ONCE(rxq->mp_params.mp_ops, NULL);
> + WRITE_ONCE(rxq->mp_params.mp_priv, NULL);
> +
> + rxq_idx = get_netdev_rx_queue_index(rxq);
> +
> + netdev_rx_queue_restart(binding->dev, rxq_idx);

What if netdev_rx_queue_restart() fails? Depending on where it failed, a
queue might still be filled from struct net_devmem_dmabuf_binding. This
is one downside of the current situation with netdev_rx_queue_restart()
needing to do allocations each time.

Perhaps a full reset if individual queue restart fails?



Re: [PATCH net-next v9 05/14] netdev: netdevice devmem allocator

2024-05-17 Thread David Wei
On 2024-05-10 16:21, Mina Almasry wrote:
> +/* This returns the absolute dma_addr_t calculated from
> + * net_iov_owner(niov)->owner->base_dma_addr, not the page_pool-owned
> + * niov->dma_addr.
> + *
> + * The absolute dma_addr_t is a dma_addr_t that is always uncompressed.
> + *
> + * The page_pool-owner niov->dma_addr is the absolute dma_addr compressed 
> into
> + * an unsigned long. Special handling is done when the unsigned long is 
> 32-bit
> + * but the dma_addr_t is 64-bit.
> + *
> + * In general code looking for the dma_addr_t should use net_iov_dma_addr(),
> + * while page_pool code looking for the unsigned long dma_addr which mirrors
> + * the field in struct page should use niov->dma_addr.
> + */
> +static inline dma_addr_t net_iov_dma_addr(const struct net_iov *niov)
> +{
> + struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov);
> +
> + return owner->base_dma_addr +
> +((dma_addr_t)net_iov_idx(niov) << PAGE_SHIFT);
> +}

This part feels like devmem TCP specific, yet the function is in
netmem.h. Please consider moving it into devmem.{h,c} which makes it
less likely that people not reading your comment will try using it.

> +
> +static inline struct net_devmem_dmabuf_binding *
> +net_iov_binding(const struct net_iov *niov)
> +{
> + return net_iov_owner(niov)->binding;
> +}
> +
>  /* netmem */
>  
>  /**
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index d82f92d7cf9ce..1f90e23a81441 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -54,6 +54,42 @@ void __net_devmem_dmabuf_binding_free(struct 
> net_devmem_dmabuf_binding *binding)
>   kfree(binding);
>  }
>  
> +struct net_iov *
> +net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
> +{
> + struct dmabuf_genpool_chunk_owner *owner;
> + unsigned long dma_addr;
> + struct net_iov *niov;
> + ssize_t offset;
> + ssize_t index;
> +
> + dma_addr = gen_pool_alloc_owner(binding->chunk_pool, PAGE_SIZE,
> + (void **));
> + if (!dma_addr)
> + return NULL;
> +
> + offset = dma_addr - owner->base_dma_addr;
> + index = offset / PAGE_SIZE;
> + niov = >niovs[index];
> +
> + niov->dma_addr = 0;
> +
> + net_devmem_dmabuf_binding_get(binding);
> +
> + return niov;
> +}
> +
> +void net_devmem_free_dmabuf(struct net_iov *niov)
> +{
> + struct net_devmem_dmabuf_binding *binding = net_iov_binding(niov);
> + unsigned long dma_addr = net_iov_dma_addr(niov);
> +
> + if (gen_pool_has_addr(binding->chunk_pool, dma_addr, PAGE_SIZE))
> + gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
> +
> + net_devmem_dmabuf_binding_put(binding);
> +}
> +
>  /* Protected by rtnl_lock() */
>  static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
>  


Re: [RFC PATCH net-next v8 07/14] page_pool: devmem support

2024-05-05 Thread David Wei
On 2024-05-01 00:55, Christoph Hellwig wrote:
> On Fri, Apr 26, 2024 at 05:17:52PM -0700, David Wei wrote:
>> On 2024-04-02 5:20 pm, Mina Almasry wrote:
>>> @@ -69,20 +106,26 @@ net_iov_binding(const struct net_iov *niov)
>>>   */
>>>  typedef unsigned long __bitwise netmem_ref;
>>>  
>>> +static inline bool netmem_is_net_iov(const netmem_ref netmem)
>>> +{
>>> +#if defined(CONFIG_PAGE_POOL) && defined(CONFIG_DMA_SHARED_BUFFER)
>>
>> I am guessing you added this to try and speed up the fast path? It's
>> overly restrictive for us since we do not need dmabuf necessarily. I
>> spent a bit too much time wondering why things aren't working only to
>> find this :(
> 
> So what else do you need?  I was assured last round that nothing but
> dmabuf and potentially the huge page case (that really just is the page
> provider) would get added.

I'm using userspace memory so having this gated behind
CONFIG_DMA_SHARED_BUFFER doesn't make sense for us.

> 
>>
> ---end quoted text---


Re: [RFC PATCH net-next v8 07/14] page_pool: devmem support

2024-04-26 Thread David Wei
On 2024-04-02 5:20 pm, Mina Almasry wrote:
> @@ -69,20 +106,26 @@ net_iov_binding(const struct net_iov *niov)
>   */
>  typedef unsigned long __bitwise netmem_ref;
>  
> +static inline bool netmem_is_net_iov(const netmem_ref netmem)
> +{
> +#if defined(CONFIG_PAGE_POOL) && defined(CONFIG_DMA_SHARED_BUFFER)

I am guessing you added this to try and speed up the fast path? It's
overly restrictive for us since we do not need dmabuf necessarily. I
spent a bit too much time wondering why things aren't working only to
find this :(


Re: [RFC PATCH net-next v8 04/14] netdev: support binding dma-buf to netdevice

2024-04-24 Thread David Wei
On 2024-04-02 5:20 pm, Mina Almasry wrote:
> Add a netdev_dmabuf_binding struct which represents the
> dma-buf-to-netdevice binding. The netlink API will bind the dma-buf to
> rx queues on the netdevice. On the binding, the dma_buf_attach
> & dma_buf_map_attachment will occur. The entries in the sg_table from
> mapping will be inserted into a genpool to make it ready
> for allocation.
> 
> The chunks in the genpool are owned by a dmabuf_chunk_owner struct which
> holds the dma-buf offset of the base of the chunk and the dma_addr of
> the chunk. Both are needed to use allocations that come from this chunk.
> 
> We create a new type that represents an allocation from the genpool:
> net_iov. We setup the net_iov allocation size in the
> genpool to PAGE_SIZE for simplicity: to match the PAGE_SIZE normally
> allocated by the page pool and given to the drivers.
> 
> The user can unbind the dmabuf from the netdevice by closing the netlink
> socket that established the binding. We do this so that the binding is
> automatically unbound even if the userspace process crashes.
> 
> The binding and unbinding leaves an indicator in struct netdev_rx_queue
> that the given queue is bound, but the binding doesn't take effect until
> the driver actually reconfigures its queues, and re-initializes its page
> pool.
> 
> The netdev_dmabuf_binding struct is refcounted, and releases its
> resources only when all the refs are released.
> 
> Signed-off-by: Willem de Bruijn 
> Signed-off-by: Kaiyuan Zhang 
> Signed-off-by: Mina Almasry 
> 
> ---
> 
> v8:
> - move dmabuf_devmem_ops usage to later patch to avoid patch-by-patch
>   build error.
> 
> v7:
> - Use IS_ERR() instead of IS_ERR_OR_NULL() for the dma_buf_get() return
>   value.
> - Changes netdev_* naming in devmem.c to net_devmem_* (Yunsheng).
> - DMA_BIDIRECTIONAL -> DMA_FROM_DEVICE (Yunsheng).
> - Added a comment around recovering of the old rx queue in
>   net_devmem_restart_rx_queue(), and added freeing of old_mem if the
>   restart of the old queue fails. (Yunsheng).
> - Use kernel-family sock-priv (Jakub).
> - Put pp_memory_provider_params in netdev_rx_queue instead of the
>   dma-buf specific binding (Pavel & David).
> - Move queue management ops to queue_mgmt_ops instead of netdev_ops
>   (Jakub).
> - Remove excess whitespaces (Jakub).
> - Use genlmsg_iput (Jakub).
> 
> v6:
> - Validate rx queue index
> - Refactor new functions into devmem.c (Pavel)
> 
> v5:
> - Renamed page_pool_iov to net_iov, and moved that support to devmem.h
>   or netmem.h.
> 
> v1:
> 
> - Introduce devmem.h instead of bloating netdevice.h (Jakub)
> - ENOTSUPP -> EOPNOTSUPP (checkpatch.pl I think)
> - Remove unneeded rcu protection for binding->list (rtnl protected)
> - Removed extraneous err_binding_put: label.
> - Removed dma_addr += len (Paolo).
> - Don't override err on netdev_bind_dmabuf_to_queue failure.
> - Rename devmem -> dmabuf (David).
> - Add id to dmabuf binding (David/Stan).
> - Fix missing xa_destroy bound_rq_list.
> - Use queue api to reset bound RX queues (Jakub).
> - Update netlink API for rx-queue type (tx/re) (Jakub).
> 
> RFC v3:
> - Support multi rx-queue binding
> 
> ---
>  Documentation/netlink/specs/netdev.yaml |   4 +
>  include/net/devmem.h| 111 +
>  include/net/netdev_rx_queue.h   |   2 +
>  include/net/netmem.h|  10 +
>  include/net/page_pool/types.h   |   5 +
>  net/core/Makefile   |   2 +-
>  net/core/dev.c  |   3 +
>  net/core/devmem.c   | 303 
>  net/core/netdev-genl-gen.c  |   4 +
>  net/core/netdev-genl-gen.h  |   4 +
>  net/core/netdev-genl.c  | 105 +++-
>  11 files changed, 550 insertions(+), 3 deletions(-)
>  create mode 100644 include/net/devmem.h
>  create mode 100644 net/core/devmem.c
> 
> diff --git a/Documentation/netlink/specs/netdev.yaml 
> b/Documentation/netlink/specs/netdev.yaml
> index 275d1faa87a6..bf4e58dfe9dd 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -550,6 +550,10 @@ operations:
>  - tx-packets
>  - tx-bytes
>  
> +kernel-family:
> +  headers: [ "linux/list.h"]
> +  sock-priv: struct list_head
> +
>  mcast-groups:
>list:
>  -
> diff --git a/include/net/devmem.h b/include/net/devmem.h
> new file mode 100644
> index ..fa03bdabdffd
> --- /dev/null
> +++ b/include/net/devmem.h
> @@ -0,0 +1,111 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Device memory TCP support
> + *
> + * Authors:  Mina Almasry 
> + *   Willem de Bruijn 
> + *   Kaiyuan Zhang 
> + *
> + */
> +#ifndef _NET_DEVMEM_H
> +#define _NET_DEVMEM_H
> +
> +struct net_devmem_dmabuf_binding {
> + struct dma_buf *dmabuf;
> + struct dma_buf_attachment *attachment;
> + struct sg_table *sgt;
> + struct net_device *dev;
> + struct 

Re: [RFC PATCH net-next v6 02/15] net: page_pool: create hooks for custom page providers

2024-03-17 Thread David Wei
On 2024-03-17 19:02, Christoph Hellwig wrote:
> On Mon, Mar 04, 2024 at 06:01:37PM -0800, Mina Almasry wrote:
>> From: Jakub Kicinski 
>>
>> The page providers which try to reuse the same pages will
>> need to hold onto the ref, even if page gets released from
>> the pool - as in releasing the page from the pp just transfers
>> the "ownership" reference from pp to the provider, and provider
>> will wait for other references to be gone before feeding this
>> page back into the pool.
> 
> The word hook always rings a giant warning bell for me, and looking into
> this series I am concerned indeed.
> 
> The only provider provided here is the dma-buf one, and that basically
> is the only sensible one for the documented design.  So instead of
> adding hooks that random proprietary crap can hook into, why not hard
> code the dma buf provide and just use a flag?  That'll also avoid
> expensive indirect calls.

I'm working on a similar proposal for zero copy Rx but to host memory
and depend on this memory provider API.

Jakub also designed this API for hugepages too IIRC. Basically there's
going to be at least three fancy ways of providing pages (one of which
isn't actually pages, hence the merged netmem_t series) to drivers.

> 


Re: [RFC PATCH net-next v6 01/15] queue_api: define queue api

2024-03-08 Thread David Wei
On 2024-03-04 18:01, Mina Almasry wrote:
> This API enables the net stack to reset the queues used for devmem.
> 
> Signed-off-by: Mina Almasry 
> 
> ---
>  include/linux/netdevice.h | 24 
>  1 file changed, 24 insertions(+)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index c41019f34179..3105c586355d 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1435,6 +1435,20 @@ struct netdev_net_notifier {
>   *  struct kernel_hwtstamp_config *kernel_config,
>   *  struct netlink_ext_ack *extack);
>   *   Change the hardware timestamping parameters for NIC device.
> + *
> + * void *(*ndo_queue_mem_alloc)(struct net_device *dev, int idx);
> + *   Allocate memory for an RX queue. The memory returned in the form of
> + *   a void * can be passed to ndo_queue_mem_free() for freeing or to
> + *   ndo_queue_start to create an RX queue with this memory.
> + *
> + * void  (*ndo_queue_mem_free)(struct net_device *dev, void *);
> + *   Free memory from an RX queue.
> + *
> + * int (*ndo_queue_start)(struct net_device *dev, int idx, void *);
> + *   Start an RX queue at the specified index.
> + *
> + * int (*ndo_queue_stop)(struct net_device *dev, int idx, void **);
> + *   Stop the RX queue at the specified index.
>   */
>  struct net_device_ops {
>   int (*ndo_init)(struct net_device *dev);
> @@ -1679,6 +1693,16 @@ struct net_device_ops {
>   int (*ndo_hwtstamp_set)(struct net_device *dev,
>   struct 
> kernel_hwtstamp_config *kernel_config,
>   struct netlink_ext_ack 
> *extack);
> + void *  (*ndo_queue_mem_alloc)(struct net_device *dev,
> +int idx);
> + void(*ndo_queue_mem_free)(struct net_device *dev,
> +   void *queue_mem);
> + int (*ndo_queue_start)(struct net_device *dev,
> +int idx,
> +void *queue_mem);
> + int (*ndo_queue_stop)(struct net_device *dev,
> +   int idx,
> +   void **out_queue_mem);
>  };

I'm working to port bnxt over to using this API. What are your thoughts
on maybe pulling this out and use bnxt to drive it?

>  
>  /**


Re: [RFC PATCH net-next v6 02/15] net: page_pool: create hooks for custom page providers

2024-03-07 Thread David Wei
On 2024-03-04 18:01, Mina Almasry wrote:
> From: Jakub Kicinski 
> 
> The page providers which try to reuse the same pages will
> need to hold onto the ref, even if page gets released from
> the pool - as in releasing the page from the pp just transfers
> the "ownership" reference from pp to the provider, and provider
> will wait for other references to be gone before feeding this
> page back into the pool.
> 
> Signed-off-by: Jakub Kicinski 
> Signed-off-by: Mina Almasry 
> 
> ---
> 
> This is implemented by Jakub in his RFC:
> https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168...@redhat.com/T/
> 
> I take no credit for the idea or implementation; I only added minor
> edits to make this workable with device memory TCP, and removed some
> hacky test code. This is a critical dependency of device memory TCP
> and thus I'm pulling it into this series to make it revewable and
> mergeable.
> 
> RFC v3 -> v1
> - Removed unusued mem_provider. (Yunsheng).
> - Replaced memory_provider & mp_priv with netdev_rx_queue (Jakub).
> 
> ---
>  include/net/page_pool/types.h | 12 ++
>  net/core/page_pool.c  | 43 +++
>  2 files changed, 50 insertions(+), 5 deletions(-)
> 
> diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
> index 5e43a08d3231..ffe5f31fb0da 100644
> --- a/include/net/page_pool/types.h
> +++ b/include/net/page_pool/types.h
> @@ -52,6 +52,7 @@ struct pp_alloc_cache {
>   * @dev: device, for DMA pre-mapping purposes
>   * @netdev:  netdev this pool will serve (leave as NULL if none or multiple)
>   * @napi:NAPI which is the sole consumer of pages, otherwise NULL
> + * @queue:   struct netdev_rx_queue this page_pool is being created for.
>   * @dma_dir: DMA mapping direction
>   * @max_len: max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV
>   * @offset:  DMA sync address offset for PP_FLAG_DMA_SYNC_DEV
> @@ -64,6 +65,7 @@ struct page_pool_params {
>   int nid;
>   struct device   *dev;
>   struct napi_struct *napi;
> + struct netdev_rx_queue *queue;
>   enum dma_data_direction dma_dir;
>   unsigned intmax_len;
>   unsigned intoffset;
> @@ -126,6 +128,13 @@ struct page_pool_stats {
>  };
>  #endif
>  
> +struct memory_provider_ops {
> + int (*init)(struct page_pool *pool);
> + void (*destroy)(struct page_pool *pool);
> + struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
> + bool (*release_page)(struct page_pool *pool, struct page *page);
> +};

Separate question as I try to adapt bnxt to this and your queue
configuration API.

How does GVE handle the need to allocate kernel pages for headers and
dmabuf for payloads?

Reading the code, struct gve_rx_ring is the main per-ring object with a
page pool. gve_queue_page_lists are filled with page pool netmem
allocations from the page pool in gve_alloc_queue_page_list(). Are these
strictly used for payloads only?

I found a struct gve_header_buf in both gve_rx_ring and struct
gve_per_rx_queue_mem_dpo. This is allocated in gve_rx_queue_mem_alloc()
using dma_alloc_coherent(). Is this where GVE stores headers?

IOW, GVE only uses page pool to allocate memory for QPLs, and QPLs are
used by the device for split payloads. Is my understanding correct?

> +
>  struct page_pool {
>   struct page_pool_params_fast p;
>  
> @@ -176,6 +185,9 @@ struct page_pool {
>*/
>   struct ptr_ring ring;
>  
> + void *mp_priv;
> + const struct memory_provider_ops *mp_ops;
> +
>  #ifdef CONFIG_PAGE_POOL_STATS
>   /* recycle stats are per-cpu to avoid locking */
>   struct page_pool_recycle_stats __percpu *recycle_stats;
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index d706fe5548df..8776fcad064a 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -25,6 +25,8 @@
>  
>  #include "page_pool_priv.h"
>  
> +static DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers);
> +
>  #define DEFER_TIME (msecs_to_jiffies(1000))
>  #define DEFER_WARN_INTERVAL (60 * HZ)
>  
> @@ -177,6 +179,7 @@ static int page_pool_init(struct page_pool *pool,
> int cpuid)
>  {
>   unsigned int ring_qsize = 1024; /* Default */
> + int err;
>  
>   memcpy(>p, >fast, sizeof(pool->p));
>   memcpy(>slow, >slow, sizeof(pool->slow));
> @@ -248,10 +251,25 @@ static int page_pool_init(struct page_pool *pool,
>   /* Driver calling page_pool_create() also call page_pool_destroy() */
>   refcount_set(>user_cnt, 1);
>  
> + if (pool->mp_ops) {
> + err = pool->mp_ops->init(pool);
> + if (err) {
> + pr_warn("%s() mem-provider init failed %d\n",
> + __func__, err);
> + goto free_ptr_ring;
> + }
> +
> + static_branch_inc(_pool_mem_providers);
> + }
> +
>   if (pool->p.flags & 

Re: [RFC PATCH net-next v6 09/15] memory-provider: dmabuf devmem memory provider

2024-03-05 Thread David Wei
On 2024-03-05 18:42, Mina Almasry wrote:
> On Tue, Mar 5, 2024 at 6:28 PM David Wei  wrote:
>>
>> On 2024-03-04 18:01, Mina Almasry wrote:
>>> + if (pool->p.queue)
>>> + binding = READ_ONCE(pool->p.queue->binding);
>>> +
>>> + if (binding) {
>>> + pool->mp_ops = _devmem_ops;
>>> + pool->mp_priv = binding;
>>> + }
>>
>> This is specific to TCP devmem. For ZC Rx we will need something more
>> generic to let us pass our own memory provider backend down to the page
>> pool.
>>
>> What about storing ops and priv void ptr in struct netdev_rx_queue
>> instead? Then we can both use it.
> 
> Yes, this is dmabuf specific, I was thinking you'd define your own
> member of netdev_rx_queue, and then add something like this to
> page_pool_init:
> 
> +   if (pool->p.queue)
> +   io_uring_metadata = 
> READ_ONCE(pool->p.queue->io_uring_metadata);
> +
> +   /* We don't support rx-queues that are configured for both
> io_uring & dmabuf binding */
> +   BUG_ON(io_uring_metadata && binding);
> +
> +   if (io_uring_metadata) {
> +   pool->mp_ops = _uring_ops;
> +   pool->mp_priv = io_uring_metadata;
> +   }
> 
> I.e., we share the pool->mp_ops and the pool->mp_priv but we don't
> really need to share the same netdev_rx_queue member. For me it's a
> dma-buf specific data structure (netdev_dmabuf_binding) and for you
> it's something else.

This adds size to struct netdev_rx_queue and requires checks on whether
both are set. There can be thousands of these structs at any one time so
if we don't need to add size unnecessarily then that would be best.

We can disambiguate by comparing _ops and then cast the void ptr to
our impl specific objects.

What do you not like about this approach?

> 
> page_pool_init() probably needs to validate that the queue is
> configured for dma-buf or io_uring but not both. If it's configured
> for both then the user is doing something funky we shouldn't support.
> 
> Perhaps I can make the intention clearer by renaming 'binding' to
> something more specific to dma-buf like queue->dmabuf_binding, to make
> it clear that this is the dma-buf binding and not some other binding
> like io_uring?
> 


Re: [RFC PATCH net-next v6 09/15] memory-provider: dmabuf devmem memory provider

2024-03-05 Thread David Wei
On 2024-03-04 18:01, Mina Almasry wrote:
> + if (pool->p.queue)
> + binding = READ_ONCE(pool->p.queue->binding);
> +
> + if (binding) {
> + pool->mp_ops = _devmem_ops;
> + pool->mp_priv = binding;
> + }

This is specific to TCP devmem. For ZC Rx we will need something more
generic to let us pass our own memory provider backend down to the page
pool.

What about storing ops and priv void ptr in struct netdev_rx_queue
instead? Then we can both use it.


Re: [RFC PATCH net-next v6 02/15] net: page_pool: create hooks for custom page providers

2024-03-05 Thread David Wei
On 2024-03-04 18:01, Mina Almasry wrote:
> +struct memory_provider_ops {
> + int (*init)(struct page_pool *pool);
> + void (*destroy)(struct page_pool *pool);
> + struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
> + bool (*release_page)(struct page_pool *pool, struct page *page);

For ZC Rx we added a scrub() function to memory_provider_ops that is
called from page_pool_scrub(). Does TCP devmem not custom behaviour
waiting for all netmem_refs to return before destroying the page pool?
What happens if e.g. application crashes?


Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-08 Thread David Wei
On 2023-11-07 15:03, Mina Almasry wrote:
> On Tue, Nov 7, 2023 at 2:55 PM David Ahern  wrote:
>>
>> On 11/7/23 3:10 PM, Mina Almasry wrote:
>>> On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:

 On 11/5/23 7:44 PM, Mina Almasry wrote:
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index eeeda849115c..1c351c138a5b 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>  };
>
>  #ifdef CONFIG_DMA_SHARED_BUFFER
> +struct page_pool_iov *
> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
> +void netdev_free_devmem(struct page_pool_iov *ppiov);

 netdev_{alloc,free}_dmabuf?

>>>
>>> Can do.
>>>
 I say that because a dmabuf can be host memory, at least I am not aware
 of a restriction that a dmabuf is device memory.

>>>
>>> In my limited experience dma-buf is generally device memory, and
>>> that's really its use case. CONFIG_UDMABUF is a driver that mocks
>>> dma-buf with a memfd which I think is used for testing. But I can do
>>> the rename, it's more clear anyway, I think.
>>
>> config UDMABUF
>> bool "userspace dmabuf misc driver"
>> default n
>> depends on DMA_SHARED_BUFFER
>> depends on MEMFD_CREATE || COMPILE_TEST
>> help
>>   A driver to let userspace turn memfd regions into dma-bufs.
>>   Qemu can use this to create host dmabufs for guest framebuffers.
>>
>>
>> Qemu is just a userspace process; it is no way a special one.
>>
>> Treating host memory as a dmabuf should radically simplify the io_uring
>> extension of this set.
> 
> I agree actually, and I was about to make that comment to David Wei's
> series once I have the time.
> 
> David, your io_uring RX zerocopy proposal actually works with devmem
> TCP, if you're inclined to do that instead, what you'd do roughly is
> (I think):
> 
> - Allocate a memfd,
> - Use CONFIG_UDMABUF to create a dma-buf out of that memfd.
> - Bind the dma-buf to the NIC using the netlink API in this RFC.
> - Your io_uring extensions and io_uring uapi should work as-is almost
> on top of this series, I think.
> 
> If you do this the incoming packets should land into your memfd, which
> may or may not work for you. In the future if you feel inclined to use
> device memory, this approach that I'm describing here would be more
> extensible to device memory, because you'd already be using dma-bufs
> for your user memory; you'd just replace one kind of dma-buf (UDMABUF)
> with another.
> 

How would TCP devmem change if we no longer assume that dmabuf is device
memory? Pavel will know more on the perf side, but I wouldn't want to
put any if/else on the hot path if we can avoid it. I could be wrong,
but right now in my mind using different memory providers solves this
neatly and the driver/networking stack doesn't need to care.

Mina, I believe you said at NetDev conf that you already had an udmabuf
implementation for testing. I would like to see this (you can send
privately) to see how TCP devmem would handle both user memory and
device memory.

>> That the io_uring set needs to dive into
>> page_pools is just wrong - complicating the design and code and pushing
>> io_uring into a realm it does not need to be involved in.
>>
>> Most (all?) of this patch set can work with any memory; only device
>> memory is unreadable.
>>
>>
> 
> 


Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-08 Thread David Wei
On 2023-11-07 14:55, David Ahern wrote:
> On 11/7/23 3:10 PM, Mina Almasry wrote:
>> On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:
>>>
>>> On 11/5/23 7:44 PM, Mina Almasry wrote:
 diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
 index eeeda849115c..1c351c138a5b 100644
 --- a/include/linux/netdevice.h
 +++ b/include/linux/netdevice.h
 @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
  };

  #ifdef CONFIG_DMA_SHARED_BUFFER
 +struct page_pool_iov *
 +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
 +void netdev_free_devmem(struct page_pool_iov *ppiov);
>>>
>>> netdev_{alloc,free}_dmabuf?
>>>
>>
>> Can do.
>>
>>> I say that because a dmabuf can be host memory, at least I am not aware
>>> of a restriction that a dmabuf is device memory.
>>>
>>
>> In my limited experience dma-buf is generally device memory, and
>> that's really its use case. CONFIG_UDMABUF is a driver that mocks
>> dma-buf with a memfd which I think is used for testing. But I can do
>> the rename, it's more clear anyway, I think.
> 
> config UDMABUF
> bool "userspace dmabuf misc driver"
> default n
> depends on DMA_SHARED_BUFFER
> depends on MEMFD_CREATE || COMPILE_TEST
> help
>   A driver to let userspace turn memfd regions into dma-bufs.
>   Qemu can use this to create host dmabufs for guest framebuffers.
> 
> 
> Qemu is just a userspace process; it is no way a special one.
> 
> Treating host memory as a dmabuf should radically simplify the io_uring
> extension of this set. That the io_uring set needs to dive into
> page_pools is just wrong - complicating the design and code and pushing
> io_uring into a realm it does not need to be involved in.

I think our io_uring proposal will already be vastly simplified once we
rebase onto Kuba's page pool memory provider API. Using udmabuf means
depending on a driver designed for testing, vs io_uring's registered
buffers API that's been tried and tested.

I don't have an intuitive understanding of the trade offs yet, and would
need to try out udmabuf and compare vs say using our own page pool
memory provider.

> 
> Most (all?) of this patch set can work with any memory; only device
> memory is unreadable.
> 
> 


Re: [RFC PATCH v3 04/12] netdev: support binding dma-buf to netdevice

2023-11-08 Thread David Wei
On 2023-11-05 18:44, Mina Almasry wrote:
> Add a netdev_dmabuf_binding struct which represents the
> dma-buf-to-netdevice binding. The netlink API will bind the dma-buf to
> rx queues on the netdevice. On the binding, the dma_buf_attach
> & dma_buf_map_attachment will occur. The entries in the sg_table from
> mapping will be inserted into a genpool to make it ready
> for allocation.
> 
> The chunks in the genpool are owned by a dmabuf_chunk_owner struct which
> holds the dma-buf offset of the base of the chunk and the dma_addr of
> the chunk. Both are needed to use allocations that come from this chunk.
> 
> We create a new type that represents an allocation from the genpool:
> page_pool_iov. We setup the page_pool_iov allocation size in the
> genpool to PAGE_SIZE for simplicity: to match the PAGE_SIZE normally
> allocated by the page pool and given to the drivers.
> 
> The user can unbind the dmabuf from the netdevice by closing the netlink
> socket that established the binding. We do this so that the binding is
> automatically unbound even if the userspace process crashes.
> 
> The binding and unbinding leaves an indicator in struct netdev_rx_queue
> that the given queue is bound, but the binding doesn't take effect until
> the driver actually reconfigures its queues, and re-initializes its page
> pool.
> 
> The netdev_dmabuf_binding struct is refcounted, and releases its
> resources only when all the refs are released.
> 
> Signed-off-by: Willem de Bruijn 
> Signed-off-by: Kaiyuan Zhang 
> Signed-off-by: Mina Almasry 
> 
> ---
> 
> RFC v3:
> - Support multi rx-queue binding
> 
> ---
>  include/linux/netdevice.h |  80 ++
>  include/net/netdev_rx_queue.h |   1 +
>  include/net/page_pool/types.h |  27 +
>  net/core/dev.c| 203 ++
>  net/core/netdev-genl.c| 116 ++-
>  5 files changed, 425 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index b8bf669212cc..eeeda849115c 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -52,6 +52,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  struct netpoll_info;
>  struct device;
> @@ -808,6 +810,84 @@ bool rps_may_expire_flow(struct net_device *dev, u16 
> rxq_index, u32 flow_id,
>  #endif
>  #endif /* CONFIG_RPS */
>  
> +struct netdev_dmabuf_binding {
> + struct dma_buf *dmabuf;
> + struct dma_buf_attachment *attachment;
> + struct sg_table *sgt;
> + struct net_device *dev;
> + struct gen_pool *chunk_pool;
> +
> + /* The user holds a ref (via the netlink API) for as long as they want
> +  * the binding to remain alive. Each page pool using this binding holds
> +  * a ref to keep the binding alive. Each allocated page_pool_iov holds a
> +  * ref.
> +  *
> +  * The binding undos itself and unmaps the underlying dmabuf once all
> +  * those refs are dropped and the binding is no longer desired or in
> +  * use.
> +  */
> + refcount_t ref;
> +
> + /* The portid of the user that owns this binding. Used for netlink to
> +  * notify us of the user dropping the bind.
> +  */
> + u32 owner_nlportid;
> +
> + /* The list of bindings currently active. Used for netlink to notify us
> +  * of the user dropping the bind.
> +  */
> + struct list_head list;
> +
> + /* rxq's this binding is active on. */
> + struct xarray bound_rxq_list;
> +};
> +
> +#ifdef CONFIG_DMA_SHARED_BUFFER
> +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding);
> +int netdev_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
> +struct netdev_dmabuf_binding **out);
> +void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding);
> +int netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> + struct netdev_dmabuf_binding *binding);
> +#else
> +static inline void
> +__netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding)
> +{
> +}
> +
> +static inline int netdev_bind_dmabuf(struct net_device *dev,
> +  unsigned int dmabuf_fd,
> +  struct netdev_dmabuf_binding **out)
> +{
> + return -EOPNOTSUPP;
> +}
> +static inline void netdev_unbind_dmabuf(struct netdev_dmabuf_binding 
> *binding)
> +{
> +}
> +
> +static inline int
> +netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> + struct netdev_dmabuf_binding *binding)
> +{
> + return -EOPNOTSUPP;
> +}
> +#endif
> +
> +static inline void
> +netdev_devmem_binding_get(struct netdev_dmabuf_binding *binding)
> +{
> + refcount_inc(>ref);
> +}
> +
> +static inline void
> +netdev_devmem_binding_put(struct netdev_dmabuf_binding *binding)
> +{
> + if (!refcount_dec_and_test(>ref))
> + return;
> +
> + __netdev_devmem_binding_free(binding);
> +}
> 

Re: [RFC PATCH v2 06/11] page-pool: add device memory support

2023-09-07 Thread David Wei
On 19/08/2023 13:24, Mina Almasry wrote:
> On Sat, Aug 19, 2023 at 8:22 AM Jesper Dangaard Brouer
>  wrote:
>>
>>
>>
>> On 19/08/2023 16.08, Willem de Bruijn wrote:
>>> On Sat, Aug 19, 2023 at 5:51 AM Jesper Dangaard Brouer
>>>  wrote:



 On 10/08/2023 03.57, Mina Almasry wrote:
> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
>
> Refactor mm calls on struct page * into helpers, and add page_pool_iov
> handling on those helpers. Modify callers of these mm APIs with calls to
> these helpers instead.
>

 I don't like of this approach.
 This is adding code to the PP (page_pool) fast-path in multiple places.

 I've not had time to run my usual benchmarks, which are here:

 https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c

> 
> Thank you for linking that, I'll try to run these against the next submission.
> 
 But I'm sure it will affect performance.

 Regardless of performance, this approach is using ptr-LSB-bits, to hide
 that page-pointer are not really struct-pages, feels like force feeding
 a solution just to use the page_pool APIs.


> In areas where struct page* is dereferenced, add a check for special
> handling of page_pool_iov.
>
> The memory providers producing page_pool_iov can set the LSB on the
> struct page* returned to the page pool.
>
> Note that instead of overloading the LSB of page pointers, we can
> instead define a new union between struct page & struct page_pool_iov and
> compact it in a new type. However, we'd need to implement the code churn
> to modify the page_pool & drivers to use this new type. For this POC
> that is not implemented (feedback welcome).
>

 I've said before, that I prefer multiplexing on page->pp_magic.
 For your page_pool_iov the layout would have to match the offset of
 pp_magic, to do this. (And if insisting on using PP infra the refcnt
 would also need to align).
>>>
>>> Perhaps I misunderstand, but this suggests continuing to using
>>> struct page to demultiplex memory type?
>>>
>>
>> (Perhaps we are misunderstanding each-other and my use of the words
>> multiplexing and demultiplex are wrong, I'm sorry, as English isn't my
>> native language.)
>>
>> I do see the problem of depending on having a struct page, as the
>> page_pool_iov isn't related to struct page.  Having "page" in the name
>> of "page_pool_iov" is also confusing (hardest problem is CS is naming,
>> as we all know).
>>
>> To support more allocator types, perhaps skb->pp_recycle bit need to
>> grow another bit (and be renamed skb->recycle), so we can tell allocator
>> types apart, those that are page based and those whom are not.
>>
>>
>>> I think the feedback has been strong to not multiplex yet another
>>> memory type into that struct, that is not a real page. Which is why
>>> we went into this direction. This latest series limits the impact largely
>>> to networking structures and code.
>>>
>>
>> Some what related what I'm objecting to: the "page_pool_iov" is not a
>> real page, but this getting recycled into something called "page_pool",
>> which funny enough deals with struct-pages internally and depend on the
>> struct-page-refcnt.
>>
>> Given the approach changed way from using struct page, then I also don't
>> see the connection with the page_pool. Sorry.
>>
>>> One way or another, there will be a branch and multiplexing. Whether
>>> that is in struct page, the page pool or a new netdev mem type as you
>>> propose.
>>>
>>
>> I'm asking to have this branch/multiplexing done a the call sites.
>>
>> (IMHO not changing the drivers is a pipe-dream.)
>>
> 
> I think I understand what Jesper is saying. I think Jesper wants the
> page_pool to remain unchanged, and another layer on top of it to do
> the multiplexing, i.e.:
> 
> driver ---> new_layer (does multiplexing) ---> page_pool ---> mm layer
> \-->
> devmem_pool --> dma-buf layer
> 
> But, I think, Jakub wants the page_pool to be the front end, and for
> the multiplexing to happen in the page pool (I think, Jakub did not
> write this in an email, but this is how I interpret his comments from
> [1], and his memory provider RFC). So I think Jakub wants:
> 
> driver --> page_pool ---> memory_provider (does multiplexing) --->
> basic_provider ---> mm layer
> 
> \> devmem_provider --> dma-buf
> layer
> 
> That is the approach in this RFC.
> 
> I think proper naming that makes sense can be figured out, and is not
> a huge issue. I think in both cases we can minimize the changes to the
> drivers, maybe. In the first approach the driver will need to use the
> APIs created by the new layer. In the second approach, the driver
> continues to use page_pool APIs.
> 
> I think we need to converge on a path between 

Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice

2023-09-07 Thread David Wei
On 09/08/2023 18:57, Mina Almasry wrote:
> Add a netdev_dmabuf_binding struct which represents the
> dma-buf-to-netdevice binding. The netlink API will bind the dma-buf to
> an rx queue on the netdevice. On the binding, the dma_buf_attach
> & dma_buf_map_attachment will occur. The entries in the sg_table from
> mapping will be inserted into a genpool to make it ready
> for allocation.
> 
> The chunks in the genpool are owned by a dmabuf_chunk_owner struct which
> holds the dma-buf offset of the base of the chunk and the dma_addr of
> the chunk. Both are needed to use allocations that come from this chunk.
> 
> We create a new type that represents an allocation from the genpool:
> page_pool_iov. We setup the page_pool_iov allocation size in the
> genpool to PAGE_SIZE for simplicity: to match the PAGE_SIZE normally
> allocated by the page pool and given to the drivers.
> 
> The user can unbind the dmabuf from the netdevice by closing the netlink
> socket that established the binding. We do this so that the binding is
> automatically unbound even if the userspace process crashes.
> 
> The binding and unbinding leaves an indicator in struct netdev_rx_queue
> that the given queue is bound, but the binding doesn't take effect until
> the driver actually reconfigures its queues, and re-initializes its page
> pool. This issue/weirdness is highlighted in the memory provider
> proposal[1], and I'm hoping that some generic solution for all
> memory providers will be discussed; this patch doesn't address that
> weirdness again.
> 
> The netdev_dmabuf_binding struct is refcounted, and releases its
> resources only when all the refs are released.
> 
> [1] https://lore.kernel.org/netdev/20230707183935.997267-1-k...@kernel.org/
> 
> Signed-off-by: Willem de Bruijn 
> Signed-off-by: Kaiyuan Zhang 
> 
> Signed-off-by: Mina Almasry 
> ---
>  include/linux/netdevice.h |  57 
>  include/net/page_pool.h   |  27 ++
>  net/core/dev.c| 178 ++
>  net/core/netdev-genl.c| 101 -
>  4 files changed, 361 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 3800d0479698..1b7c5966d2ca 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -53,6 +53,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> 
>  struct netpoll_info;
>  struct device;
> @@ -782,6 +784,55 @@ bool rps_may_expire_flow(struct net_device *dev, u16 
> rxq_index, u32 flow_id,
>  #endif
>  #endif /* CONFIG_RPS */
> 
> +struct netdev_dmabuf_binding {
> + struct dma_buf *dmabuf;
> + struct dma_buf_attachment *attachment;
> + struct sg_table *sgt;
> + struct net_device *dev;
> + struct gen_pool *chunk_pool;
> +
> + /* The user holds a ref (via the netlink API) for as long as they want
> +  * the binding to remain alive. Each page pool using this binding holds
> +  * a ref to keep the binding alive. Each allocated page_pool_iov holds a
> +  * ref.
> +  *
> +  * The binding undos itself and unmaps the underlying dmabuf once all
> +  * those refs are dropped and the binding is no longer desired or in
> +  * use.
> +  */
> + refcount_t ref;
> +
> + /* The portid of the user that owns this binding. Used for netlink to
> +  * notify us of the user dropping the bind.
> +  */
> + u32 owner_nlportid;
> +
> + /* The list of bindings currently active. Used for netlink to notify us
> +  * of the user dropping the bind.
> +  */
> + struct list_head list;
> +
> + /* rxq's this binding is active on. */
> + struct xarray bound_rxq_list;
> +};
> +
> +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding);
> +
> +static inline void
> +netdev_devmem_binding_get(struct netdev_dmabuf_binding *binding)
> +{
> + refcount_inc(>ref);
> +}
> +
> +static inline void
> +netdev_devmem_binding_put(struct netdev_dmabuf_binding *binding)
> +{
> + if (!refcount_dec_and_test(>ref))
> + return;
> +
> + __netdev_devmem_binding_free(binding);
> +}
> +
>  /* This structure contains an instance of an RX queue. */
>  struct netdev_rx_queue {
>   struct xdp_rxq_info xdp_rxq;
> @@ -796,6 +847,7 @@ struct netdev_rx_queue {
>  #ifdef CONFIG_XDP_SOCKETS
>   struct xsk_buff_pool*pool;
>  #endif
> + struct netdev_dmabuf_binding*binding;
>  } cacheline_aligned_in_smp;
> 
>  /*
> @@ -5026,6 +5078,11 @@ void netif_set_tso_max_segs(struct net_device *dev, 
> unsigned int segs);
>  void netif_inherit_tso_max(struct net_device *to,
>  const struct net_device *from);
> 
> +void netdev_unbind_dmabuf_to_queue(struct netdev_dmabuf_binding *binding);
> +int netdev_bind_dmabuf_to_queue(struct net_device *dev, unsigned int 
> dmabuf_fd,
> + u32 rxq_idx,
> + struct netdev_dmabuf_binding 

Re: [RFC PATCH v2 00/11] Device Memory TCP

2023-08-23 Thread David Wei
On 17/08/2023 15:18, Mina Almasry wrote:
> On Thu, Aug 17, 2023 at 11:04 AM Pavel Begunkov  
> wrote:
>>
>> On 8/14/23 02:12, David Ahern wrote:
>>> On 8/9/23 7:57 PM, Mina Almasry wrote:
>>>> Changes in RFC v2:
>>>> --
>> ...
>>>> ** Test Setup
>>>>
>>>> Kernel: net-next with this RFC and memory provider API cherry-picked
>>>> locally.
>>>>
>>>> Hardware: Google Cloud A3 VMs.
>>>>
>>>> NIC: GVE with header split & RSS & flow steering support.
>>>
>>> This set seems to depend on Jakub's memory provider patches and a netdev
>>> driver change which is not included. For the testing mentioned here, you
>>> must have a tree + branch with all of the patches. Is it publicly available?
>>>
>>> It would be interesting to see how well (easy) this integrates with
>>> io_uring. Besides avoiding all of the syscalls for receiving the iov and
>>> releasing the buffers back to the pool, io_uring also brings in the
>>> ability to seed a page_pool with registered buffers which provides a
>>> means to get simpler Rx ZC for host memory.
>>
>> The patchset sounds pretty interesting. I've been working with David Wei
>> (CC'ing) on io_uring zc rx (prototype polishing stage) all that is old
>> similar approaches based on allocating an rx queue. It targets host
>> memory and device memory as an extra feature, uapi is different, lifetimes
>> are managed/bound to io_uring. Completions/buffers are returned to user via
>> a separate queue instead of cmsg, and pushed back granularly to the kernel
>> via another queue. I'll leave it to David to elaborate
>>
>> It sounds like we have space for collaboration here, if not merging then
>> reusing internals as much as we can, but we'd need to look into the
>> details deeper.
>>
> 
> I'm happy to look at your implementation and collaborate on something
> that works for both use cases. Feel free to share unpolished prototype
> so I can start having a general idea if possible.

Hi I'm David and I am working with Pavel on this. We will have something to
share with you on the mailing list before the end of the week.

I'm also preparing a submission for NetDev conf. I wonder if you and others at
Google plan to present there as well? If so, then we may want to coordinate our
submissions and talks (if accepted).

Please let me know this week, thanks!

> 
>>> Overall I like the intent and possibilities for extensions, but a lot of
>>> details are missing - perhaps some are answered by seeing an end-to-end
>>> implementation.
>>
>> --
>> Pavel Begunkov
> 
> 
>