On 11/19, Bobby Eshleman wrote: > From: Bobby Eshleman <[email protected]> > > Add support for autorelease toggling of tokens using a static branch to > control system-wide behavior. This allows applications to choose between > two memory management modes: > > 1. Autorelease on: Leaked tokens are automatically released when the > socket closes. > > 2. Autorelease off: Leaked tokens are released during dmabuf unbind. > > The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE > attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per > binding is disallowed and is rejected by netlink. The system will be > "locked" into the mode that the first binding is set to. It can only be > changed again once there are zero bindings on the system. > > Disabling autorelease offers ~13% improvement in CPU utilization. > > Static branching is used to limit the system to one mode or the other. > > Signed-off-by: Bobby Eshleman <[email protected]> > --- > Changes in v7: > - implement autorelease with static branch (Stan) > - use netlink instead of sockopt (Stan) > - merge uAPI and implementation patches into one patch (seemed less > confusing) > > Changes in v6: > - remove sk_devmem_info.autorelease, using binding->autorelease instead > - move binding->autorelease check to outside of > net_devmem_dmabuf_binding_put_urefs() (Mina) > - remove overly defensive net_is_devmem_iov() (Mina) > - add comment about multiple urefs mapping to a single netmem ref (Mina) > - remove overly defense netmem NULL and netmem_is_net_iov checks (Mina) > - use niov without casting back and forth with netmem (Mina) > - move the autorelease flag from per-binding to per-socket (Mina) > - remove the batching logic in sock_devmem_dontneed_manual_release() > (Mina) > - move autorelease check inside tcp_xa_pool_commit() (Mina) > - remove single-binding restriction for autorelease mode (Mina) > - unbind always checks for leaked urefs > > Changes in v5: > - remove unused variables > - introduce autorelease flag, preparing for future patch toggle new > behavior > > Changes in v3: > - make urefs per-binding instead of per-socket, reducing memory > footprint > - fallback to cleaning up references in dmabuf unbind if socket leaked > tokens > - drop ethtool patch > > Changes in v2: > - always use GFP_ZERO for binding->vec (Mina) > - remove WARN for changed binding (Mina) > - remove extraneous binding ref get (Mina) > - remove WARNs on invalid user input (Mina) > - pre-assign niovs in binding->vec for RX case (Mina) > - use atomic_set(, 0) to initialize sk_user_frags.urefs > - fix length of alloc for urefs > --- > Documentation/netlink/specs/netdev.yaml | 12 ++++ > include/net/netmem.h | 1 + > include/net/sock.h | 7 +- > include/uapi/linux/netdev.h | 1 + > net/core/devmem.c | 109 > +++++++++++++++++++++++++++----- > net/core/devmem.h | 11 +++- > net/core/netdev-genl-gen.c | 5 +- > net/core/netdev-genl.c | 13 +++- > net/core/sock.c | 57 +++++++++++++++-- > net/ipv4/tcp.c | 78 ++++++++++++++++++----- > net/ipv4/tcp_ipv4.c | 13 +++- > net/ipv4/tcp_minisocks.c | 3 +- > tools/include/uapi/linux/netdev.h | 1 + > 13 files changed, 262 insertions(+), 49 deletions(-) > > diff --git a/Documentation/netlink/specs/netdev.yaml > b/Documentation/netlink/specs/netdev.yaml > index 82bf5cb2617d..913fccca4c4e 100644 > --- a/Documentation/netlink/specs/netdev.yaml > +++ b/Documentation/netlink/specs/netdev.yaml > @@ -562,6 +562,17 @@ attribute-sets: > type: u32 > checks: > min: 1 > + - > + name: autorelease > + doc: | > + Token autorelease mode. If true (1), leaked tokens are > automatically > + released when the socket closes. If false (0), leaked tokens are > only > + released when the dmabuf is unbound. Once a binding is created > with a > + specific mode, all subsequent bindings system-wide must use the > same > + mode. > + > + Optional. Defaults to false if not specified. > + type: u8 > > operations: > list: > @@ -767,6 +778,7 @@ operations: > - ifindex > - fd > - queues > + - autorelease > reply: > attributes: > - id > diff --git a/include/net/netmem.h b/include/net/netmem.h > index 9e10f4ac50c3..80d2263ba4ed 100644 > --- a/include/net/netmem.h > +++ b/include/net/netmem.h > @@ -112,6 +112,7 @@ struct net_iov { > }; > struct net_iov_area *owner; > enum net_iov_type type; > + atomic_t uref; > }; > > struct net_iov_area { > diff --git a/include/net/sock.h b/include/net/sock.h > index a5f36ea9d46f..797b21148945 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -350,7 +350,7 @@ struct sk_filter; > * @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS > * @sk_scm_unused: unused flags for scm_recv() > * @ns_tracker: tracker for netns reference > - * @sk_user_frags: xarray of pages the user is holding a reference on. > + * @sk_devmem_info: the devmem binding information for the socket > * @sk_owner: reference to the real owner of the socket that calls > * sock_lock_init_class_and_name(). > */ > @@ -579,7 +579,10 @@ struct sock { > struct numa_drop_counters *sk_drop_counters; > struct rcu_head sk_rcu; > netns_tracker ns_tracker; > - struct xarray sk_user_frags; > + struct { > + struct xarray frags; > + struct net_devmem_dmabuf_binding *binding; > + } sk_devmem_info; > > #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES) > struct module *sk_owner; > diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h > index 048c8de1a130..dff0be8223a4 100644 > --- a/include/uapi/linux/netdev.h > +++ b/include/uapi/linux/netdev.h > @@ -206,6 +206,7 @@ enum { > NETDEV_A_DMABUF_QUEUES, > NETDEV_A_DMABUF_FD, > NETDEV_A_DMABUF_ID, > + NETDEV_A_DMABUF_AUTORELEASE, > > __NETDEV_A_DMABUF_MAX, > NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1) > diff --git a/net/core/devmem.c b/net/core/devmem.c > index 4dee2666dd07..bba21c6cb195 100644 > --- a/net/core/devmem.c > +++ b/net/core/devmem.c > @@ -11,6 +11,7 @@ > #include <linux/genalloc.h> > #include <linux/mm.h> > #include <linux/netdevice.h> > +#include <linux/skbuff_ref.h> > #include <linux/types.h> > #include <net/netdev_queues.h> > #include <net/netdev_rx_queue.h> > @@ -28,6 +29,17 @@ > > static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1); > > +/* Static key to lock down autorelease to a single mode on a system. When > + * enabled: autorelease mode (leaked tokens released on socket close). When > + * disabled: manual mode (leaked tokens released on dmabuf unbind). Once the > + * first binding is created, the mode is locked system-wide and can not be > + * changed until the system has zero bindings again. > + * > + * Protected by xa_lock of net_devmem_dmabuf_bindings. > + */ > +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key); > +EXPORT_SYMBOL(tcp_devmem_ar_key); > + > static const struct memory_provider_ops dmabuf_devmem_ops; > > bool net_is_devmem_iov(struct net_iov *niov) > @@ -116,6 +128,24 @@ void net_devmem_free_dmabuf(struct net_iov *niov) > gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE); > } > > +static void > +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding > *binding) > +{ > + int i; > + > + for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) { > + struct net_iov *niov; > + netmem_ref netmem; > + > + niov = binding->vec[i]; > + netmem = net_iov_to_netmem(niov); > + > + /* Multiple urefs map to only a single netmem ref. */ > + if (atomic_xchg(&niov->uref, 0) > 0) > + WARN_ON_ONCE(!napi_pp_put_page(netmem)); > + } > +} > + > void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding) > { > struct netdev_rx_queue *rxq; > @@ -143,6 +173,10 @@ void net_devmem_unbind_dmabuf(struct > net_devmem_dmabuf_binding *binding) > __net_mp_close_rxq(binding->dev, rxq_idx, &mp_params); > } > > + /* Clean up any lingering urefs from sockets that had autorelease > + * disabled. > + */ > + net_devmem_dmabuf_binding_put_urefs(binding); > net_devmem_dmabuf_binding_put(binding); > } > > @@ -179,8 +213,10 @@ struct net_devmem_dmabuf_binding * > net_devmem_bind_dmabuf(struct net_device *dev, > struct device *dma_dev, > enum dma_data_direction direction, > - unsigned int dmabuf_fd, struct netdev_nl_sock *priv, > - struct netlink_ext_ack *extack) > + unsigned int dmabuf_fd, > + struct netdev_nl_sock *priv, > + struct netlink_ext_ack *extack, > + bool autorelease) > { > struct net_devmem_dmabuf_binding *binding; > static u32 id_alloc_next; > @@ -231,14 +267,13 @@ net_devmem_bind_dmabuf(struct net_device *dev, > goto err_detach; > } > > - if (direction == DMA_TO_DEVICE) { > - binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE, > - sizeof(struct net_iov *), > - GFP_KERNEL); > - if (!binding->vec) { > - err = -ENOMEM; > - goto err_unmap; > - } > + /* Used by TX and also by RX when socket has autorelease disabled */ > + binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE, > + sizeof(struct net_iov *), > + GFP_KERNEL | __GFP_ZERO); > + if (!binding->vec) { > + err = -ENOMEM; > + goto err_unmap; > } > > /* For simplicity we expect to make PAGE_SIZE allocations, but the > @@ -292,25 +327,67 @@ net_devmem_bind_dmabuf(struct net_device *dev, > niov = &owner->area.niovs[i]; > niov->type = NET_IOV_DMABUF; > niov->owner = &owner->area; > + atomic_set(&niov->uref, 0); > page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov), > > net_devmem_get_dma_addr(niov)); > - if (direction == DMA_TO_DEVICE) > - binding->vec[owner->area.base_virtual / > PAGE_SIZE + i] = niov; > + binding->vec[owner->area.base_virtual / PAGE_SIZE + i] > = niov; > } > > virtual += len; > } > > - err = xa_alloc_cyclic(&net_devmem_dmabuf_bindings, &binding->id, > - binding, xa_limit_32b, &id_alloc_next, > - GFP_KERNEL); > + /* Enforce system-wide autorelease mode consistency for RX bindings. > + * TX bindings don't use autorelease (always false) since tokens aren't > + * leaked in TX path. Only RX bindings must all have the same > + * autorelease mode, never mixed. > + * > + * We use the xarray's lock to atomically check xa_empty() and toggle > + * the static key, avoiding the race where two new bindings may try to > + * set the static key to different states. > + */ > + xa_lock(&net_devmem_dmabuf_bindings); > + > + if (direction == DMA_FROM_DEVICE) { > + if (!xa_empty(&net_devmem_dmabuf_bindings)) { > + bool mode; > + > + mode = static_key_enabled(&tcp_devmem_ar_key); > + > + /* When bindings exist, enforce that the mode does not > + * change. > + */ > + if (mode != autorelease) { > + NL_SET_ERR_MSG_FMT(extack, > + "System already configured > with autorelease=%d", > + mode); > + err = -EINVAL; > + goto err_unlock_xa; > + } > + } else { > + /* First binding sets the mode for all subsequent > + * bindings. > + */ > + if (autorelease) > + static_branch_enable(&tcp_devmem_ar_key);
[..] > + else > + static_branch_disable(&tcp_devmem_ar_key); nit: don't think this is needed? DEFINE_STATIC_KEY_FALSE should already have it disabled by default iiuc. I was also expecting to see something in net_devmem_unbind_dmabuf to undo this static_branch_enable(tcp_devmem_ar_key) and bring the system back into the default state, but I can't find it. Am I missing something? The rest looks good to me, thanks! Let's see if Mina sees any problems with this approach. IMO, it's a bit easier to reason about two separate code paths now.
