On 11/19, Bobby Eshleman wrote:
> From: Bobby Eshleman <[email protected]>
> 
> Add support for autorelease toggling of tokens using a static branch to
> control system-wide behavior. This allows applications to choose between
> two memory management modes:
> 
> 1. Autorelease on: Leaked tokens are automatically released when the
>    socket closes.
> 
> 2. Autorelease off: Leaked tokens are released during dmabuf unbind.
> 
> The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE
> attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per
> binding is disallowed and is rejected by netlink. The system will be
> "locked" into the mode that the first binding is set to. It can only be
> changed again once there are zero bindings on the system.
> 
> Disabling autorelease offers ~13% improvement in CPU utilization.
> 
> Static branching is used to limit the system to one mode or the other.
> 
> Signed-off-by: Bobby Eshleman <[email protected]>
> ---
> Changes in v7:
> - implement autorelease with static branch (Stan)
> - use netlink instead of sockopt (Stan)
> - merge uAPI and implementation patches into one patch (seemed less
>   confusing)
> 
> Changes in v6:
> - remove sk_devmem_info.autorelease, using binding->autorelease instead
> - move binding->autorelease check to outside of
>   net_devmem_dmabuf_binding_put_urefs() (Mina)
> - remove overly defensive net_is_devmem_iov() (Mina)
> - add comment about multiple urefs mapping to a single netmem ref (Mina)
> - remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
> - use niov without casting back and forth with netmem (Mina)
> - move the autorelease flag from per-binding to per-socket (Mina)
> - remove the batching logic in sock_devmem_dontneed_manual_release()
>   (Mina)
> - move autorelease check inside tcp_xa_pool_commit() (Mina)
> - remove single-binding restriction for autorelease mode (Mina)
> - unbind always checks for leaked urefs
> 
> Changes in v5:
> - remove unused variables
> - introduce autorelease flag, preparing for future patch toggle new
>   behavior
> 
> Changes in v3:
> - make urefs per-binding instead of per-socket, reducing memory
>   footprint
> - fallback to cleaning up references in dmabuf unbind if socket leaked
>   tokens
> - drop ethtool patch
> 
> Changes in v2:
> - always use GFP_ZERO for binding->vec (Mina)
> - remove WARN for changed binding (Mina)
> - remove extraneous binding ref get (Mina)
> - remove WARNs on invalid user input (Mina)
> - pre-assign niovs in binding->vec for RX case (Mina)
> - use atomic_set(, 0) to initialize sk_user_frags.urefs
> - fix length of alloc for urefs
> ---
>  Documentation/netlink/specs/netdev.yaml |  12 ++++
>  include/net/netmem.h                    |   1 +
>  include/net/sock.h                      |   7 +-
>  include/uapi/linux/netdev.h             |   1 +
>  net/core/devmem.c                       | 109 
> +++++++++++++++++++++++++++-----
>  net/core/devmem.h                       |  11 +++-
>  net/core/netdev-genl-gen.c              |   5 +-
>  net/core/netdev-genl.c                  |  13 +++-
>  net/core/sock.c                         |  57 +++++++++++++++--
>  net/ipv4/tcp.c                          |  78 ++++++++++++++++++-----
>  net/ipv4/tcp_ipv4.c                     |  13 +++-
>  net/ipv4/tcp_minisocks.c                |   3 +-
>  tools/include/uapi/linux/netdev.h       |   1 +
>  13 files changed, 262 insertions(+), 49 deletions(-)
> 
> diff --git a/Documentation/netlink/specs/netdev.yaml 
> b/Documentation/netlink/specs/netdev.yaml
> index 82bf5cb2617d..913fccca4c4e 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -562,6 +562,17 @@ attribute-sets:
>          type: u32
>          checks:
>            min: 1
> +      -
> +        name: autorelease
> +        doc: |
> +          Token autorelease mode. If true (1), leaked tokens are 
> automatically
> +          released when the socket closes. If false (0), leaked tokens are 
> only
> +          released when the dmabuf is unbound. Once a binding is created 
> with a
> +          specific mode, all subsequent bindings system-wide must use the 
> same
> +          mode.
> +
> +          Optional. Defaults to false if not specified.
> +        type: u8
>  
>  operations:
>    list:
> @@ -767,6 +778,7 @@ operations:
>              - ifindex
>              - fd
>              - queues
> +            - autorelease
>          reply:
>            attributes:
>              - id
> diff --git a/include/net/netmem.h b/include/net/netmem.h
> index 9e10f4ac50c3..80d2263ba4ed 100644
> --- a/include/net/netmem.h
> +++ b/include/net/netmem.h
> @@ -112,6 +112,7 @@ struct net_iov {
>       };
>       struct net_iov_area *owner;
>       enum net_iov_type type;
> +     atomic_t uref;
>  };
>  
>  struct net_iov_area {
> diff --git a/include/net/sock.h b/include/net/sock.h
> index a5f36ea9d46f..797b21148945 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -350,7 +350,7 @@ struct sk_filter;
>    *  @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
>    *  @sk_scm_unused: unused flags for scm_recv()
>    *  @ns_tracker: tracker for netns reference
> -  *  @sk_user_frags: xarray of pages the user is holding a reference on.
> +  *  @sk_devmem_info: the devmem binding information for the socket
>    *  @sk_owner: reference to the real owner of the socket that calls
>    *             sock_lock_init_class_and_name().
>    */
> @@ -579,7 +579,10 @@ struct sock {
>       struct numa_drop_counters *sk_drop_counters;
>       struct rcu_head         sk_rcu;
>       netns_tracker           ns_tracker;
> -     struct xarray           sk_user_frags;
> +     struct {
> +             struct xarray                           frags;
> +             struct net_devmem_dmabuf_binding        *binding;
> +     } sk_devmem_info;
>  
>  #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
>       struct module           *sk_owner;
> diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> index 048c8de1a130..dff0be8223a4 100644
> --- a/include/uapi/linux/netdev.h
> +++ b/include/uapi/linux/netdev.h
> @@ -206,6 +206,7 @@ enum {
>       NETDEV_A_DMABUF_QUEUES,
>       NETDEV_A_DMABUF_FD,
>       NETDEV_A_DMABUF_ID,
> +     NETDEV_A_DMABUF_AUTORELEASE,
>  
>       __NETDEV_A_DMABUF_MAX,
>       NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index 4dee2666dd07..bba21c6cb195 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -11,6 +11,7 @@
>  #include <linux/genalloc.h>
>  #include <linux/mm.h>
>  #include <linux/netdevice.h>
> +#include <linux/skbuff_ref.h>
>  #include <linux/types.h>
>  #include <net/netdev_queues.h>
>  #include <net/netdev_rx_queue.h>
> @@ -28,6 +29,17 @@
>  
>  static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
>  
> +/* Static key to lock down autorelease to a single mode on a system.  When
> + * enabled: autorelease mode (leaked tokens released on socket close).  When
> + * disabled: manual mode (leaked tokens released on dmabuf unbind).  Once the
> + * first binding is created, the mode is locked system-wide and can not be
> + * changed until the system has zero bindings again.
> + *
> + * Protected by xa_lock of net_devmem_dmabuf_bindings.
> + */
> +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
> +EXPORT_SYMBOL(tcp_devmem_ar_key);
> +
>  static const struct memory_provider_ops dmabuf_devmem_ops;
>  
>  bool net_is_devmem_iov(struct net_iov *niov)
> @@ -116,6 +128,24 @@ void net_devmem_free_dmabuf(struct net_iov *niov)
>       gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
>  }
>  
> +static void
> +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding 
> *binding)
> +{
> +     int i;
> +
> +     for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) {
> +             struct net_iov *niov;
> +             netmem_ref netmem;
> +
> +             niov = binding->vec[i];
> +             netmem = net_iov_to_netmem(niov);
> +
> +             /* Multiple urefs map to only a single netmem ref. */
> +             if (atomic_xchg(&niov->uref, 0) > 0)
> +                     WARN_ON_ONCE(!napi_pp_put_page(netmem));
> +     }
> +}
> +
>  void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
>  {
>       struct netdev_rx_queue *rxq;
> @@ -143,6 +173,10 @@ void net_devmem_unbind_dmabuf(struct 
> net_devmem_dmabuf_binding *binding)
>               __net_mp_close_rxq(binding->dev, rxq_idx, &mp_params);
>       }
>  
> +     /* Clean up any lingering urefs from sockets that had autorelease
> +      * disabled.
> +      */
> +     net_devmem_dmabuf_binding_put_urefs(binding);
>       net_devmem_dmabuf_binding_put(binding);
>  }
>  
> @@ -179,8 +213,10 @@ struct net_devmem_dmabuf_binding *
>  net_devmem_bind_dmabuf(struct net_device *dev,
>                      struct device *dma_dev,
>                      enum dma_data_direction direction,
> -                    unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
> -                    struct netlink_ext_ack *extack)
> +                    unsigned int dmabuf_fd,
> +                    struct netdev_nl_sock *priv,
> +                    struct netlink_ext_ack *extack,
> +                    bool autorelease)
>  {
>       struct net_devmem_dmabuf_binding *binding;
>       static u32 id_alloc_next;
> @@ -231,14 +267,13 @@ net_devmem_bind_dmabuf(struct net_device *dev,
>               goto err_detach;
>       }
>  
> -     if (direction == DMA_TO_DEVICE) {
> -             binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> -                                           sizeof(struct net_iov *),
> -                                           GFP_KERNEL);
> -             if (!binding->vec) {
> -                     err = -ENOMEM;
> -                     goto err_unmap;
> -             }
> +     /* Used by TX and also by RX when socket has autorelease disabled */
> +     binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> +                                   sizeof(struct net_iov *),
> +                                   GFP_KERNEL | __GFP_ZERO);
> +     if (!binding->vec) {
> +             err = -ENOMEM;
> +             goto err_unmap;
>       }
>  
>       /* For simplicity we expect to make PAGE_SIZE allocations, but the
> @@ -292,25 +327,67 @@ net_devmem_bind_dmabuf(struct net_device *dev,
>                       niov = &owner->area.niovs[i];
>                       niov->type = NET_IOV_DMABUF;
>                       niov->owner = &owner->area;
> +                     atomic_set(&niov->uref, 0);
>                       page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
>                                                     
> net_devmem_get_dma_addr(niov));
> -                     if (direction == DMA_TO_DEVICE)
> -                             binding->vec[owner->area.base_virtual / 
> PAGE_SIZE + i] = niov;
> +                     binding->vec[owner->area.base_virtual / PAGE_SIZE + i] 
> = niov;
>               }
>  
>               virtual += len;
>       }
>  
> -     err = xa_alloc_cyclic(&net_devmem_dmabuf_bindings, &binding->id,
> -                           binding, xa_limit_32b, &id_alloc_next,
> -                           GFP_KERNEL);
> +     /* Enforce system-wide autorelease mode consistency for RX bindings.
> +      * TX bindings don't use autorelease (always false) since tokens aren't
> +      * leaked in TX path. Only RX bindings must all have the same
> +      * autorelease mode, never mixed.
> +      *
> +      * We use the xarray's lock to atomically check xa_empty() and toggle
> +      * the static key, avoiding the race where two new bindings may try to
> +      * set the static key to different states.
> +      */
> +     xa_lock(&net_devmem_dmabuf_bindings);
> +
> +     if (direction == DMA_FROM_DEVICE) {
> +             if (!xa_empty(&net_devmem_dmabuf_bindings)) {
> +                     bool mode;
> +
> +                     mode = static_key_enabled(&tcp_devmem_ar_key);
> +
> +                     /* When bindings exist, enforce that the mode does not
> +                      * change.
> +                      */
> +                     if (mode != autorelease) {
> +                             NL_SET_ERR_MSG_FMT(extack,
> +                                                "System already configured 
> with autorelease=%d",
> +                                                mode);
> +                             err = -EINVAL;
> +                             goto err_unlock_xa;
> +                     }
> +             } else {
> +                     /* First binding sets the mode for all subsequent
> +                      * bindings.
> +                      */
> +                     if (autorelease)
> +                             static_branch_enable(&tcp_devmem_ar_key);

[..]

> +                     else
> +                             static_branch_disable(&tcp_devmem_ar_key);

nit: don't think this is needed? DEFINE_STATIC_KEY_FALSE should already
have it disabled by default iiuc.

I was also expecting to see something in net_devmem_unbind_dmabuf to undo
this static_branch_enable(tcp_devmem_ar_key) and bring the system
back into the default state, but I can't find it. Am I missing
something?

The rest looks good to me, thanks! Let's see if Mina sees any problems
with this approach. IMO, it's a bit easier to reason about two separate
code paths now.

Reply via email to