On Wed, Jan 21, 2026 at 08:15:23PM -0800, Mina Almasry wrote:
> On Thu, Jan 15, 2026 at 9:03 PM Bobby Eshleman <[email protected]>
> wrote:
> >
> > From: Bobby Eshleman <[email protected]>
> >
> > Add support for autorelease toggling of tokens using a static branch to
> > control system-wide behavior. This allows applications to choose between
> > two memory management modes:
> >
> > 1. Autorelease on: Leaked tokens are automatically released when the
> > socket closes.
> >
> > 2. Autorelease off: Leaked tokens are released during dmabuf unbind.
> >
> > The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE
> > attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per
> > binding is disallowed and is rejected by netlink. The system will be
> > "locked" into the mode that the first binding is set to. It can only be
> > changed again once there are zero bindings on the system.
> >
> > Disabling autorelease offers ~13% improvement in CPU utilization.
> >
> > Static branching is used to limit the system to one mode or the other.
> >
> > The xa_erase(&net_devmem_dmabuf_bindings, ...) call is moved into
> > __net_devmem_dmabuf_binding_free(...). The result is that it becomes
> > possible to switch static branches atomically with regards to xarray
> > state. In the time window between unbind and free the socket layer can
> > still find the binding in the xarray, but it will fail to acquire
> > binding->ref (if unbind decremented to zero). This change preserves
> > correct behavior and allows us to avoid more complicated counting
> > schemes for bindings.
> >
> > Signed-off-by: Bobby Eshleman <[email protected]>
> > ---
> > Changes in v10:
> > - add binding->users to track socket and rxq users of binding, defer
> > release of urefs until binding->users hits zero to guard users from
> > incrementing urefs *after* net_devmem_dmabuf_binding_put_urefs()
> > is called. (Mina)
> > - fix error failing to restore static key state when xarray alloc fails
> > (Jakub)
> > - add wrappers for setting/unsetting mode that captures the static key +
> > rx binding count logic.
> > - move xa_erase() into __net_devmem_dmabuf_binding_free()
> > - remove net_devmem_rx_bindings_count, change xarray management to be
> > to avoid the same race as net_devmem_rx_bindings_count did
> > - check return of net_devmem_dmabuf_binding_get() in
> > tcp_recvmsg_dmabuf()
> > - move sk_devmem_info.binding fiddling into autorelease=off static path
> >
> > Changes in v9:
> > - Add missing stub for net_devmem_dmabuf_binding_get() when NET_DEVMEM=n
> > - Add wrapper around tcp_devmem_ar_key accesses so that it may be
> > stubbed out when NET_DEVMEM=n
> > - only dec rx binding count for rx bindings in free (v8 did not exclude
> > TX bindings)
> >
> > Changes in v8:
> > - Only reset static key when bindings go to zero, defaulting back to
> > disabled (Stan).
> > - Fix bad usage of xarray spinlock for sleepy static branch switching,
> > use mutex instead.
> > - Access pp_ref_count via niov->desc instead of niov directly.
> > - Move reset of static key to __net_devmem_dmabuf_binding_free() so that
> > the static key can not be changed while there are outstanding tokens
> > (free is only called when reference count reaches zero).
> > - Add net_devmem_dmabuf_rx_bindings_count because tokens may be active
> > even after xa_erase(), so static key changes must wait until all
> > RX bindings are finally freed (not just when xarray is empty). A
> > counter is a simple way to track this.
> > - socket takes reference on the binding, to avoid use-after-free on
> > sk_devmem_info.binding in the case that user releases all tokens,
> > unbinds, then issues SO_DEVMEM_DONTNEED again (with bad token).
> > - removed some comments that were unnecessary
> >
> > Changes in v7:
> > - implement autorelease with static branch (Stan)
> > - use netlink instead of sockopt (Stan)
> > - merge uAPI and implementation patches into one patch (seemed less
> > confusing)
> >
> > Changes in v6:
> > - remove sk_devmem_info.autorelease, using binding->autorelease instead
> > - move binding->autorelease check to outside of
> > net_devmem_dmabuf_binding_put_urefs() (Mina)
> > - remove overly defensive net_is_devmem_iov() (Mina)
> > - add comment about multiple urefs mapping to a single netmem ref (Mina)
> > - remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
> > - use niov without casting back and forth with netmem (Mina)
> > - move the autorelease flag from per-binding to per-socket (Mina)
> > - remove the batching logic in sock_devmem_dontneed_manual_release()
> > (Mina)
> > - move autorelease check inside tcp_xa_pool_commit() (Mina)
> > - remove single-binding restriction for autorelease mode (Mina)
> > - unbind always checks for leaked urefs
> >
> > Changes in v5:
> > - remove unused variables
> > - introduce autorelease flag, preparing for future patch toggle new
> > behavior
> >
> > Changes in v3:
> > - make urefs per-binding instead of per-socket, reducing memory
> > footprint
> > - fallback to cleaning up references in dmabuf unbind if socket leaked
> > tokens
> > - drop ethtool patch
> >
> > Changes in v2:
> > - always use GFP_ZERO for binding->vec (Mina)
> > - remove WARN for changed binding (Mina)
> > - remove extraneous binding ref get (Mina)
> > - remove WARNs on invalid user input (Mina)
> > - pre-assign niovs in binding->vec for RX case (Mina)
> > - use atomic_set(, 0) to initialize sk_user_frags.urefs
> > - fix length of alloc for urefs
> > ---
> > Documentation/netlink/specs/netdev.yaml | 12 +++
> > include/net/netmem.h | 1 +
> > include/net/sock.h | 7 +-
> > include/uapi/linux/netdev.h | 1 +
> > net/core/devmem.c | 136
> > +++++++++++++++++++++++++++-----
> > net/core/devmem.h | 64 ++++++++++++++-
> > net/core/netdev-genl-gen.c | 5 +-
> > net/core/netdev-genl.c | 10 ++-
> > net/core/sock.c | 57 +++++++++++--
> > net/ipv4/tcp.c | 87 ++++++++++++++++----
> > net/ipv4/tcp_ipv4.c | 15 +++-
> > net/ipv4/tcp_minisocks.c | 3 +-
> > tools/include/uapi/linux/netdev.h | 1 +
> > 13 files changed, 345 insertions(+), 54 deletions(-)
> >
> > diff --git a/Documentation/netlink/specs/netdev.yaml
> > b/Documentation/netlink/specs/netdev.yaml
> > index 596c306ce52b..a5301b150663 100644
> > --- a/Documentation/netlink/specs/netdev.yaml
> > +++ b/Documentation/netlink/specs/netdev.yaml
> > @@ -562,6 +562,17 @@ attribute-sets:
> > type: u32
> > checks:
> > min: 1
> > + -
> > + name: autorelease
> > + doc: |
> > + Token autorelease mode. If true (1), leaked tokens are
> > automatically
> > + released when the socket closes. If false (0), leaked tokens are
> > only
> > + released when the dmabuf is torn down. Once a binding is created
> > with
> > + a specific mode, all subsequent bindings system-wide must use the
> > + same mode.
> > +
> > + Optional. Defaults to false if not specified.
>
> Ooof. Defaults to false if not specified.
I thought we wanted to default to the optimization on?
>
> My anxiety with this patch is that running an actual ML training job
> involves many layers of middleware where we may not be able to enforce
> the "must don'tneeding before closing the socket" requirement. I'm
> curious if you have ML jobs or NCCL tests running on this and passing?
I can run NCCL on it, but haven't this rev.
>
> > + type: u8
> >
> > operations:
> > list:
> > @@ -769,6 +780,7 @@ operations:
> > - ifindex
> > - fd
> > - queues
> > + - autorelease
> > reply:
> > attributes:
> > - id
> > diff --git a/include/net/netmem.h b/include/net/netmem.h
> > index 9e10f4ac50c3..80d2263ba4ed 100644
> > --- a/include/net/netmem.h
> > +++ b/include/net/netmem.h
> > @@ -112,6 +112,7 @@ struct net_iov {
> > };
> > struct net_iov_area *owner;
> > enum net_iov_type type;
> > + atomic_t uref;
> > };
> >
> > struct net_iov_area {
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index aafe8bdb2c0f..9d3d5bde15e9 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -352,7 +352,7 @@ struct sk_filter;
> > * @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
> > * @sk_scm_unused: unused flags for scm_recv()
> > * @ns_tracker: tracker for netns reference
> > - * @sk_user_frags: xarray of pages the user is holding a reference on.
> > + * @sk_devmem_info: the devmem binding information for the socket
> > * @sk_owner: reference to the real owner of the socket that calls
> > * sock_lock_init_class_and_name().
> > */
> > @@ -584,7 +584,10 @@ struct sock {
> > struct numa_drop_counters *sk_drop_counters;
> > struct rcu_head sk_rcu;
> > netns_tracker ns_tracker;
> > - struct xarray sk_user_frags;
> > + struct {
> > + struct xarray frags;
> > + struct net_devmem_dmabuf_binding *binding;
> > + } sk_devmem_info;
> >
> > #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
> > struct module *sk_owner;
> > diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> > index e0b579a1df4f..1e5c209cb998 100644
> > --- a/include/uapi/linux/netdev.h
> > +++ b/include/uapi/linux/netdev.h
> > @@ -207,6 +207,7 @@ enum {
> > NETDEV_A_DMABUF_QUEUES,
> > NETDEV_A_DMABUF_FD,
> > NETDEV_A_DMABUF_ID,
> > + NETDEV_A_DMABUF_AUTORELEASE,
> >
> > __NETDEV_A_DMABUF_MAX,
> > NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > index 9dee697a28ee..1264d8ee40e3 100644
> > --- a/net/core/devmem.c
> > +++ b/net/core/devmem.c
> > @@ -11,6 +11,7 @@
> > #include <linux/genalloc.h>
> > #include <linux/mm.h>
> > #include <linux/netdevice.h>
> > +#include <linux/skbuff_ref.h>
> > #include <linux/types.h>
> > #include <net/netdev_queues.h>
> > #include <net/netdev_rx_queue.h>
> > @@ -27,6 +28,9 @@
> > /* Device memory support */
> >
> > static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
> > +static DEFINE_MUTEX(devmem_ar_lock);
> > +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
> > +EXPORT_SYMBOL(tcp_devmem_ar_key);
> >
> > static const struct memory_provider_ops dmabuf_devmem_ops;
> >
> > @@ -63,12 +67,71 @@ static void net_devmem_dmabuf_binding_release(struct
> > percpu_ref *ref)
> > schedule_work(&binding->unbind_w);
> > }
> >
> > +static bool net_devmem_has_rx_bindings(void)
> > +{
> > + struct net_devmem_dmabuf_binding *binding;
> > + unsigned long index;
> > +
> > + lockdep_assert_held(&devmem_ar_lock);
> > +
> > + xa_for_each(&net_devmem_dmabuf_bindings, index, binding) {
> > + if (binding->direction == DMA_FROM_DEVICE)
> > + return true;
> > + }
> > + return false;
> > +}
> > +
> > +/* caller must hold devmem_ar_lock */
> > +static int
> > +__net_devmem_dmabuf_binding_set_mode(enum dma_data_direction direction,
> > + bool autorelease)
> > +{
> > + lockdep_assert_held(&devmem_ar_lock);
> > +
> > + if (direction != DMA_FROM_DEVICE)
> > + return 0;
> > +
> > + if (net_devmem_has_rx_bindings() &&
> > + static_key_enabled(&tcp_devmem_ar_key) != autorelease)
> > + return -EBUSY;
> > +
> > + if (autorelease)
> > + static_branch_enable(&tcp_devmem_ar_key);
> > +
> > + return 0;
> > +}
> > +
> > +/* caller must hold devmem_ar_lock */
> > +static void
> > +__net_devmem_dmabuf_binding_unset_mode(enum dma_data_direction direction)
> > +{
> > + lockdep_assert_held(&devmem_ar_lock);
> > +
> > + if (direction != DMA_FROM_DEVICE)
> > + return;
> > +
> > + if (net_devmem_has_rx_bindings())
> > + return;
> > +
> > + static_branch_disable(&tcp_devmem_ar_key);
> > +}
> > +
> > void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
> > {
> > struct net_devmem_dmabuf_binding *binding = container_of(wq,
> > typeof(*binding), unbind_w);
> >
> > size_t size, avail;
> >
> > + mutex_lock(&devmem_ar_lock);
> > + xa_erase(&net_devmem_dmabuf_bindings, binding->id);
> > + __net_devmem_dmabuf_binding_unset_mode(binding->direction);
> > + mutex_unlock(&devmem_ar_lock);
> > +
> > + /* Ensure no tx net_devmem_lookup_dmabuf() are in flight after the
> > + * erase.
> > + */
> > + synchronize_net();
> > +
>
> I'm sorry I could not wrap my head around moving this block of code to
> the _free(), even though you mention it in the commit message. Why is
> removing the the binding from net_devmem_dmabuf_bindings on unbind not
> work for you? If the binding is not active on any rx queues then there
> cannot be new data received on it. Unless I'm missing something it
> should be fine to leave this where it is.
Hmmm... actually now that recvmsg gets a binding->ref, the race I was
trying to avoid can't happen (that is, dontneed using the wrong static
branch after unbind but before free). Thanks for catching that.
>
>
> > gen_pool_for_each_chunk(binding->chunk_pool,
> > net_devmem_dmabuf_free_chunk_owner, NULL);
> >
> > @@ -126,19 +189,30 @@ void net_devmem_free_dmabuf(struct net_iov *niov)
> > gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
> > }
> >
> > +void
> > +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding
> > *binding)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) {
> > + struct net_iov *niov;
> > + netmem_ref netmem;
> > +
> > + niov = binding->vec[i];
> > + netmem = net_iov_to_netmem(niov);
> > +
> > + /* Multiple urefs map to only a single netmem ref. */
> > + if (atomic_xchg(&niov->uref, 0) > 0)
> > + WARN_ON_ONCE(!napi_pp_put_page(netmem));
> > + }
> > +}
> > +
> > void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> > {
> > struct netdev_rx_queue *rxq;
> > unsigned long xa_idx;
> > unsigned int rxq_idx;
> >
> > - xa_erase(&net_devmem_dmabuf_bindings, binding->id);
> > -
> > - /* Ensure no tx net_devmem_lookup_dmabuf() are in flight after the
> > - * erase.
> > - */
> > - synchronize_net();
> > -
> > if (binding->list.next)
> > list_del(&binding->list);
> >
> > @@ -151,6 +225,8 @@ void net_devmem_unbind_dmabuf(struct
> > net_devmem_dmabuf_binding *binding)
> > rxq_idx = get_netdev_rx_queue_index(rxq);
> >
> > __net_mp_close_rxq(binding->dev, rxq_idx, &mp_params);
> > +
> > + net_devmem_dmabuf_binding_user_put(binding);
> > }
> >
> > percpu_ref_kill(&binding->ref);
> > @@ -178,6 +254,8 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device
> > *dev, u32 rxq_idx,
> > if (err)
> > goto err_close_rxq;
> >
> > + atomic_inc(&binding->users);
> > +
>
> nit: feels wrong that we have _binding_user_put() but open code the
> get(). Either open code both or helper both, maybe?
I had the same though, but the issue is that for the first queue
atomic_inc_not_zero() (the helper) will fail.
>
> > return 0;
> >
> > err_close_rxq:
> > @@ -189,8 +267,10 @@ struct net_devmem_dmabuf_binding *
> > net_devmem_bind_dmabuf(struct net_device *dev,
> > struct device *dma_dev,
> > enum dma_data_direction direction,
> > - unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
> > - struct netlink_ext_ack *extack)
> > + unsigned int dmabuf_fd,
> > + struct netdev_nl_sock *priv,
> > + struct netlink_ext_ack *extack,
> > + bool autorelease)
> > {
> > struct net_devmem_dmabuf_binding *binding;
> > static u32 id_alloc_next;
> > @@ -225,6 +305,8 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> > if (err < 0)
> > goto err_free_binding;
> >
> > + atomic_set(&binding->users, 0);
> > +
> > mutex_init(&binding->lock);
> >
> > binding->dmabuf = dmabuf;
> > @@ -245,14 +327,12 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> > goto err_detach;
> > }
> >
> > - if (direction == DMA_TO_DEVICE) {
> > - binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> > - sizeof(struct net_iov *),
> > - GFP_KERNEL);
> > - if (!binding->vec) {
> > - err = -ENOMEM;
> > - goto err_unmap;
> > - }
> > + binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> > + sizeof(struct net_iov *),
> > + GFP_KERNEL | __GFP_ZERO);
> > + if (!binding->vec) {
> > + err = -ENOMEM;
> > + goto err_unmap;
> > }
> >
> > /* For simplicity we expect to make PAGE_SIZE allocations, but the
> > @@ -306,25 +386,41 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> > niov = &owner->area.niovs[i];
> > niov->type = NET_IOV_DMABUF;
> > niov->owner = &owner->area;
> > + atomic_set(&niov->uref, 0);
> >
> > page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
> >
> > net_devmem_get_dma_addr(niov));
> > - if (direction == DMA_TO_DEVICE)
> > - binding->vec[owner->area.base_virtual /
> > PAGE_SIZE + i] = niov;
> > + binding->vec[owner->area.base_virtual / PAGE_SIZE +
> > i] = niov;
> > }
> >
> > virtual += len;
> > }
> >
> > + mutex_lock(&devmem_ar_lock);
> > +
> > + err = __net_devmem_dmabuf_binding_set_mode(direction, autorelease);
> > + if (err < 0) {
> > + NL_SET_ERR_MSG_FMT(extack,
> > + "System already configured with
> > autorelease=%d",
> > + static_key_enabled(&tcp_devmem_ar_key));
> > + goto err_unlock_mutex;
> > + }
> > +
>
> Unless I've misread something, this looks very incorrect. TX bindings
> will accidentally set the system to autorelease=false mode, no? You
> need to make sure you only set the mode in rx-bindings only, right?
Maybe the context you missed is that
__net_devmem_dmabuf_binding_set_mode() no-ops for TX (DMA_FROM_DEVICE)?
>
> > err = xa_alloc_cyclic(&net_devmem_dmabuf_bindings, &binding->id,
> > binding, xa_limit_32b, &id_alloc_next,
> > GFP_KERNEL);
> > if (err < 0)
> > - goto err_free_chunks;
> > + goto err_unset_mode;
> > +
> > + mutex_unlock(&devmem_ar_lock);
> >
> > list_add(&binding->list, &priv->bindings);
> >
> > return binding;
> >
> > +err_unset_mode:
> > + __net_devmem_dmabuf_binding_unset_mode(direction);
> > +err_unlock_mutex:
> > + mutex_unlock(&devmem_ar_lock);
> > err_free_chunks:
> > gen_pool_for_each_chunk(binding->chunk_pool,
> > net_devmem_dmabuf_free_chunk_owner, NULL);
> > diff --git a/net/core/devmem.h b/net/core/devmem.h
> > index 94874b323520..284f0ad5f381 100644
> > --- a/net/core/devmem.h
> > +++ b/net/core/devmem.h
> > @@ -12,9 +12,13 @@
> >
> > #include <net/netmem.h>
> > #include <net/netdev_netlink.h>
> > +#include <linux/jump_label.h>
> >
> > struct netlink_ext_ack;
> >
> > +/* static key for TCP devmem autorelease */
> > +extern struct static_key_false tcp_devmem_ar_key;
> > +
> > struct net_devmem_dmabuf_binding {
> > struct dma_buf *dmabuf;
> > struct dma_buf_attachment *attachment;
> > @@ -43,6 +47,12 @@ struct net_devmem_dmabuf_binding {
> > */
> > struct percpu_ref ref;
> >
> > + /* Counts sockets and rxqs that are using the binding. When this
> > + * reaches zero, all urefs are drained and new sockets cannot join
> > the
> > + * binding.
> > + */
> > + atomic_t users;
> > +
> > /* The list of bindings currently active. Used for netlink to
> > notify us
> > * of the user dropping the bind.
> > */
> > @@ -61,7 +71,7 @@ struct net_devmem_dmabuf_binding {
> >
> > /* Array of net_iov pointers for this binding, sorted by virtual
> > * address. This array is convenient to map the virtual addresses to
> > - * net_iovs in the TX path.
> > + * net_iovs.
> > */
> > struct net_iov **vec;
> >
> > @@ -88,7 +98,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> > struct device *dma_dev,
> > enum dma_data_direction direction,
> > unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
> > - struct netlink_ext_ack *extack);
> > + struct netlink_ext_ack *extack, bool autorelease);
> > struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u32 id);
> > void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding);
> > int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> > @@ -134,6 +144,26 @@ net_devmem_dmabuf_binding_put(struct
> > net_devmem_dmabuf_binding *binding)
> > percpu_ref_put(&binding->ref);
> > }
> >
> > +void net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding
> > *binding);
> > +
> > +static inline bool
> > +net_devmem_dmabuf_binding_user_get(struct net_devmem_dmabuf_binding
> > *binding)
> > +{
> > + return atomic_inc_not_zero(&binding->users);
> > +}
> > +
> > +static inline void
> > +net_devmem_dmabuf_binding_user_put(struct net_devmem_dmabuf_binding
> > *binding)
> > +{
> > + if (atomic_dec_and_test(&binding->users))
> > + net_devmem_dmabuf_binding_put_urefs(binding);
> > +}
> > +
> > +static inline bool net_devmem_autorelease_enabled(void)
> > +{
> > + return static_branch_unlikely(&tcp_devmem_ar_key);
> > +}
> > +
> > void net_devmem_get_net_iov(struct net_iov *niov);
> > void net_devmem_put_net_iov(struct net_iov *niov);
> >
> > @@ -151,11 +181,38 @@ net_devmem_get_niov_at(struct
> > net_devmem_dmabuf_binding *binding, size_t addr,
> > #else
> > struct net_devmem_dmabuf_binding;
> >
> > +static inline bool
> > +net_devmem_dmabuf_binding_get(struct net_devmem_dmabuf_binding *binding)
> > +{
> > + return false;
> > +}
> > +
> > static inline void
> > net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding)
> > {
> > }
> >
> > +static inline void
> > +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding
> > *binding)
> > +{
> > +}
> > +
> > +static inline bool
> > +net_devmem_dmabuf_binding_user_get(struct net_devmem_dmabuf_binding
> > *binding)
> > +{
> > + return false;
> > +}
> > +
> > +static inline void
> > +net_devmem_dmabuf_binding_user_put(struct net_devmem_dmabuf_binding
> > *binding)
> > +{
> > +}
> > +
> > +static inline bool net_devmem_autorelease_enabled(void)
> > +{
> > + return false;
> > +}
> > +
> > static inline void net_devmem_get_net_iov(struct net_iov *niov)
> > {
> > }
> > @@ -170,7 +227,8 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> > enum dma_data_direction direction,
> > unsigned int dmabuf_fd,
> > struct netdev_nl_sock *priv,
> > - struct netlink_ext_ack *extack)
> > + struct netlink_ext_ack *extack,
> > + bool autorelease)
> > {
> > return ERR_PTR(-EOPNOTSUPP);
> > }
> > diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
> > index ba673e81716f..01b7765e11ec 100644
> > --- a/net/core/netdev-genl-gen.c
> > +++ b/net/core/netdev-genl-gen.c
> > @@ -86,10 +86,11 @@ static const struct nla_policy
> > netdev_qstats_get_nl_policy[NETDEV_A_QSTATS_SCOPE
> > };
> >
> > /* NETDEV_CMD_BIND_RX - do */
> > -static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_FD
> > + 1] = {
> > +static const struct nla_policy
> > netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_AUTORELEASE + 1] = {
> > [NETDEV_A_DMABUF_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
> > [NETDEV_A_DMABUF_FD] = { .type = NLA_U32, },
> > [NETDEV_A_DMABUF_QUEUES] =
> > NLA_POLICY_NESTED(netdev_queue_id_nl_policy),
> > + [NETDEV_A_DMABUF_AUTORELEASE] = { .type = NLA_U8, },
>
> This patch in the series is proving complicated to review. Any changes
> you can refactor out of it for ease of review would be very welcome.
> Things like the variable renames, netlink api changes, and any
> refactors can be pulled into their own patches would ease review of
> the core functionality.
>
> > };
> >
> > /* NETDEV_CMD_NAPI_SET - do */
> > @@ -188,7 +189,7 @@ static const struct genl_split_ops netdev_nl_ops[] = {
> > .cmd = NETDEV_CMD_BIND_RX,
> > .doit = netdev_nl_bind_rx_doit,
> > .policy = netdev_bind_rx_nl_policy,
> > - .maxattr = NETDEV_A_DMABUF_FD,
> > + .maxattr = NETDEV_A_DMABUF_AUTORELEASE,
> > .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
> > },
> > {
> > diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> > index 470fabbeacd9..c742bb34865e 100644
> > --- a/net/core/netdev-genl.c
> > +++ b/net/core/netdev-genl.c
> > @@ -939,6 +939,7 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct
> > genl_info *info)
> > struct netdev_nl_sock *priv;
> > struct net_device *netdev;
> > unsigned long *rxq_bitmap;
> > + bool autorelease = false;
> > struct device *dma_dev;
> > struct sk_buff *rsp;
> > int err = 0;
> > @@ -952,6 +953,10 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct
> > genl_info *info)
> > ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]);
> > dmabuf_fd = nla_get_u32(info->attrs[NETDEV_A_DMABUF_FD]);
> >
> > + if (info->attrs[NETDEV_A_DMABUF_AUTORELEASE])
> > + autorelease =
> > +
> > !!nla_get_u8(info->attrs[NETDEV_A_DMABUF_AUTORELEASE]);
> > +
> > priv = genl_sk_priv_get(&netdev_nl_family, NETLINK_CB(skb).sk);
> > if (IS_ERR(priv))
> > return PTR_ERR(priv);
> > @@ -1002,7 +1007,8 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb,
> > struct genl_info *info)
> > }
> >
> > binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_FROM_DEVICE,
> > - dmabuf_fd, priv, info->extack);
> > + dmabuf_fd, priv, info->extack,
> > + autorelease);
> > if (IS_ERR(binding)) {
> > err = PTR_ERR(binding);
> > goto err_rxq_bitmap;
> > @@ -1097,7 +1103,7 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb,
> > struct genl_info *info)
> >
> > dma_dev = netdev_queue_get_dma_dev(netdev, 0);
> > binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_TO_DEVICE,
> > - dmabuf_fd, priv, info->extack);
> > + dmabuf_fd, priv, info->extack,
> > false);
> > if (IS_ERR(binding)) {
> > err = PTR_ERR(binding);
> > goto err_unlock_netdev;
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index f6526f43aa6e..6355c2ccfb8a 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -87,6 +87,7 @@
> >
> > #include <linux/unaligned.h>
> > #include <linux/capability.h>
> > +#include <linux/dma-buf.h>
> > #include <linux/errno.h>
> > #include <linux/errqueue.h>
> > #include <linux/types.h>
> > @@ -151,6 +152,7 @@
> > #include <uapi/linux/pidfd.h>
> >
> > #include "dev.h"
> > +#include "devmem.h"
> >
> > static DEFINE_MUTEX(proto_list_mutex);
> > static LIST_HEAD(proto_list);
> > @@ -1081,6 +1083,44 @@ static int sock_reserve_memory(struct sock *sk, int
> > bytes)
> > #define MAX_DONTNEED_TOKENS 128
> > #define MAX_DONTNEED_FRAGS 1024
> >
> > +static noinline_for_stack int
> > +sock_devmem_dontneed_manual_release(struct sock *sk,
> > + struct dmabuf_token *tokens,
> > + unsigned int num_tokens)
> > +{
> > + struct net_iov *niov;
> > + unsigned int i, j;
> > + netmem_ref netmem;
> > + unsigned int token;
> > + int num_frags = 0;
> > + int ret = 0;
> > +
> > + if (!sk->sk_devmem_info.binding)
> > + return -EINVAL;
> > +
> > + for (i = 0; i < num_tokens; i++) {
> > + for (j = 0; j < tokens[i].token_count; j++) {
> > + size_t size =
> > sk->sk_devmem_info.binding->dmabuf->size;
> > +
> > + token = tokens[i].token_start + j;
> > + if (token >= size / PAGE_SIZE)
> > + break;
> > +
> > + if (++num_frags > MAX_DONTNEED_FRAGS)
> > + return ret;
> > +
> > + niov = sk->sk_devmem_info.binding->vec[token];
> > + if (atomic_dec_and_test(&niov->uref)) {
> > + netmem = net_iov_to_netmem(niov);
> > + WARN_ON_ONCE(!napi_pp_put_page(netmem));
> > + }
> > + ret++;
> > + }
> > + }
> > +
> > + return ret;
> > +}
> > +
> > static noinline_for_stack int
> > sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token
> > *tokens,
> > unsigned int num_tokens)
> > @@ -1089,32 +1129,33 @@ sock_devmem_dontneed_autorelease(struct sock *sk,
> > struct dmabuf_token *tokens,
> > int ret = 0, num_frags = 0;
> > netmem_ref netmems[16];
> >
> > - xa_lock_bh(&sk->sk_user_frags);
> > + xa_lock_bh(&sk->sk_devmem_info.frags);
> > for (i = 0; i < num_tokens; i++) {
> > for (j = 0; j < tokens[i].token_count; j++) {
> > if (++num_frags > MAX_DONTNEED_FRAGS)
> > goto frag_limit_reached;
> >
> > netmem_ref netmem = (__force netmem_ref)__xa_erase(
> > - &sk->sk_user_frags, tokens[i].token_start +
> > j);
> > + &sk->sk_devmem_info.frags,
> > + tokens[i].token_start + j);
> >
> > if (!netmem ||
> > WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
> > continue;
> >
> > netmems[netmem_num++] = netmem;
> > if (netmem_num == ARRAY_SIZE(netmems)) {
> > - xa_unlock_bh(&sk->sk_user_frags);
> > + xa_unlock_bh(&sk->sk_devmem_info.frags);
> > for (k = 0; k < netmem_num; k++)
> >
> > WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
> > netmem_num = 0;
> > - xa_lock_bh(&sk->sk_user_frags);
> > + xa_lock_bh(&sk->sk_devmem_info.frags);
> > }
> > ret++;
> > }
> > }
> >
> > frag_limit_reached:
> > - xa_unlock_bh(&sk->sk_user_frags);
> > + xa_unlock_bh(&sk->sk_devmem_info.frags);
> > for (k = 0; k < netmem_num; k++)
> > WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
> >
> > @@ -1145,7 +1186,11 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t
> > optval, unsigned int optlen)
> > return -EFAULT;
> > }
> >
> > - ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
> > + if (net_devmem_autorelease_enabled())
> > + ret = sock_devmem_dontneed_autorelease(sk, tokens,
> > num_tokens);
> > + else
> > + ret = sock_devmem_dontneed_manual_release(sk, tokens,
> > + num_tokens);
> >
> > kvfree(tokens);
> > return ret;
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index d5319ebe2452..73a577bd8765 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -260,6 +260,7 @@
> > #include <linux/memblock.h>
> > #include <linux/highmem.h>
> > #include <linux/cache.h>
> > +#include <linux/dma-buf.h>
> > #include <linux/err.h>
> > #include <linux/time.h>
> > #include <linux/slab.h>
> > @@ -492,7 +493,8 @@ void tcp_init_sock(struct sock *sk)
> >
> > set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
> > sk_sockets_allocated_inc(sk);
> > - xa_init_flags(&sk->sk_user_frags, XA_FLAGS_ALLOC1);
> > + xa_init_flags(&sk->sk_devmem_info.frags, XA_FLAGS_ALLOC1);
> > + sk->sk_devmem_info.binding = NULL;
> > }
> > EXPORT_IPV6_MOD(tcp_init_sock);
> >
> > @@ -2424,11 +2426,12 @@ static void tcp_xa_pool_commit_locked(struct sock
> > *sk, struct tcp_xa_pool *p)
> >
> > /* Commit part that has been copied to user space. */
> > for (i = 0; i < p->idx; i++)
> > - __xa_cmpxchg(&sk->sk_user_frags, p->tokens[i],
> > XA_ZERO_ENTRY,
> > - (__force void *)p->netmems[i], GFP_KERNEL);
> > + __xa_cmpxchg(&sk->sk_devmem_info.frags, p->tokens[i],
> > + XA_ZERO_ENTRY, (__force void *)p->netmems[i],
> > + GFP_KERNEL);
> > /* Rollback what has been pre-allocated and is no longer needed. */
> > for (; i < p->max; i++)
> > - __xa_erase(&sk->sk_user_frags, p->tokens[i]);
> > + __xa_erase(&sk->sk_devmem_info.frags, p->tokens[i]);
> >
> > p->max = 0;
> > p->idx = 0;
> > @@ -2436,14 +2439,17 @@ static void tcp_xa_pool_commit_locked(struct sock
> > *sk, struct tcp_xa_pool *p)
> >
> > static void tcp_xa_pool_commit(struct sock *sk, struct tcp_xa_pool *p)
> > {
> > + if (!net_devmem_autorelease_enabled())
> > + return;
> > +
> > if (!p->max)
> > return;
> >
> > - xa_lock_bh(&sk->sk_user_frags);
> > + xa_lock_bh(&sk->sk_devmem_info.frags);
> >
> > tcp_xa_pool_commit_locked(sk, p);
> >
> > - xa_unlock_bh(&sk->sk_user_frags);
> > + xa_unlock_bh(&sk->sk_devmem_info.frags);
> > }
> >
> > static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
> > @@ -2454,24 +2460,41 @@ static int tcp_xa_pool_refill(struct sock *sk,
> > struct tcp_xa_pool *p,
> > if (p->idx < p->max)
> > return 0;
> >
> > - xa_lock_bh(&sk->sk_user_frags);
> > + xa_lock_bh(&sk->sk_devmem_info.frags);
> >
> > tcp_xa_pool_commit_locked(sk, p);
> >
> > for (k = 0; k < max_frags; k++) {
> > - err = __xa_alloc(&sk->sk_user_frags, &p->tokens[k],
> > + err = __xa_alloc(&sk->sk_devmem_info.frags, &p->tokens[k],
> > XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL);
> > if (err)
> > break;
> > }
> >
> > - xa_unlock_bh(&sk->sk_user_frags);
> > + xa_unlock_bh(&sk->sk_devmem_info.frags);
> >
> > p->max = k;
> > p->idx = 0;
> > return k ? 0 : err;
> > }
> >
> > +static void tcp_xa_pool_inc_pp_ref_count(struct tcp_xa_pool *tcp_xa_pool,
> > + skb_frag_t *frag)
> > +{
> > + struct net_iov *niov;
> > +
> > + niov = skb_frag_net_iov(frag);
> > +
> > + if (net_devmem_autorelease_enabled()) {
> > + atomic_long_inc(&niov->desc.pp_ref_count);
> > + tcp_xa_pool->netmems[tcp_xa_pool->idx++] =
> > + skb_frag_netmem(frag);
> > + } else {
> > + if (atomic_inc_return(&niov->uref) == 1)
> > + atomic_long_inc(&niov->desc.pp_ref_count);
> > + }
> > +}
> > +
> > /* On error, returns the -errno. On success, returns number of bytes sent
> > to the
> > * user. May not consume all of @remaining_len.
> > */
> > @@ -2533,6 +2556,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const
> > struct sk_buff *skb,
> > * sequence of cmsg
> > */
> > for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> > + struct net_devmem_dmabuf_binding *binding = NULL;
> > skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> > struct net_iov *niov;
> > u64 frag_offset;
> > @@ -2568,13 +2592,45 @@ static int tcp_recvmsg_dmabuf(struct sock *sk,
> > const struct sk_buff *skb,
> > start;
> > dmabuf_cmsg.frag_offset = frag_offset;
> > dmabuf_cmsg.frag_size = copy;
> > - err = tcp_xa_pool_refill(sk, &tcp_xa_pool,
> > -
> > skb_shinfo(skb)->nr_frags - i);
> > - if (err)
> > - goto out;
> > +
> > + binding = net_devmem_iov_binding(niov);
> > +
> > + if (net_devmem_autorelease_enabled()) {
> > + err = tcp_xa_pool_refill(sk,
> > +
> > &tcp_xa_pool,
> > +
> > skb_shinfo(skb)->nr_frags - i);
> > + if (err)
> > + goto out;
> > +
> > + dmabuf_cmsg.frag_token =
> > +
> > tcp_xa_pool.tokens[tcp_xa_pool.idx];
> > + } else {
> > + if (!sk->sk_devmem_info.binding) {
> > + if
> > (!net_devmem_dmabuf_binding_user_get(binding)) {
> > + err = -ENODEV;
> > + goto out;
> > + }
> > +
> > + if
> > (!net_devmem_dmabuf_binding_get(binding)) {
> > +
> > net_devmem_dmabuf_binding_user_put(binding);
> > + err = -ENODEV;
> > + goto out;
> > + }
> > +
> > + sk->sk_devmem_info.binding
> > = binding;
> > + }
> > +
> > + if (sk->sk_devmem_info.binding !=
> > binding) {
> > + err = -EFAULT;
> > + goto out;
> > + }
> > +
> > + dmabuf_cmsg.frag_token =
> > + net_iov_virtual_addr(niov)
> > >> PAGE_SHIFT;
> > + }
> > +
> >
> > /* Will perform the exchange later */
> > - dmabuf_cmsg.frag_token =
> > tcp_xa_pool.tokens[tcp_xa_pool.idx];
> > dmabuf_cmsg.dmabuf_id =
> > net_devmem_iov_binding_id(niov);
> >
> > offset += copy;
> > @@ -2587,8 +2643,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const
> > struct sk_buff *skb,
> > if (err)
> > goto out;
> >
> > - atomic_long_inc(&niov->desc.pp_ref_count);
> > - tcp_xa_pool.netmems[tcp_xa_pool.idx++] =
> > skb_frag_netmem(frag);
> > + tcp_xa_pool_inc_pp_ref_count(&tcp_xa_pool,
> > frag);
> >
> > sent += copy;
> >
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index f8a9596e8f4d..420e8c8ebf6d 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -89,6 +89,9 @@
> >
> > #include <crypto/md5.h>
> >
> > +#include <linux/dma-buf.h>
> > +#include "../core/devmem.h"
> > +
> > #include <trace/events/tcp.h>
> >
> > #ifdef CONFIG_TCP_MD5SIG
> > @@ -2492,7 +2495,7 @@ static void tcp_release_user_frags(struct sock *sk)
> > unsigned long index;
> > void *netmem;
> >
> > - xa_for_each(&sk->sk_user_frags, index, netmem)
> > + xa_for_each(&sk->sk_devmem_info.frags, index, netmem)
> > WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem));
> > #endif
> > }
> > @@ -2503,7 +2506,15 @@ void tcp_v4_destroy_sock(struct sock *sk)
> >
> > tcp_release_user_frags(sk);
> >
> > - xa_destroy(&sk->sk_user_frags);
> > + if (!net_devmem_autorelease_enabled() &&
> > sk->sk_devmem_info.binding) {
> > +
> > net_devmem_dmabuf_binding_user_put(sk->sk_devmem_info.binding);
> > + net_devmem_dmabuf_binding_put(sk->sk_devmem_info.binding);
> > + sk->sk_devmem_info.binding = NULL;
> > + WARN_ONCE(!xa_empty(&sk->sk_devmem_info.frags),
> > + "non-empty xarray discovered in autorelease off
> > mode");
> > + }
> > +
> > + xa_destroy(&sk->sk_devmem_info.frags);
> >
> > trace_tcp_destroy_sock(sk);
> >
> > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > index bd5462154f97..2aec977f5c12 100644
> > --- a/net/ipv4/tcp_minisocks.c
> > +++ b/net/ipv4/tcp_minisocks.c
> > @@ -662,7 +662,8 @@ struct sock *tcp_create_openreq_child(const struct sock
> > *sk,
> >
> > __TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
> >
> > - xa_init_flags(&newsk->sk_user_frags, XA_FLAGS_ALLOC1);
> > + xa_init_flags(&newsk->sk_devmem_info.frags, XA_FLAGS_ALLOC1);
> > + newsk->sk_devmem_info.binding = NULL;
> >
> > return newsk;
> > }
> > diff --git a/tools/include/uapi/linux/netdev.h
> > b/tools/include/uapi/linux/netdev.h
> > index e0b579a1df4f..1e5c209cb998 100644
> > --- a/tools/include/uapi/linux/netdev.h
> > +++ b/tools/include/uapi/linux/netdev.h
> > @@ -207,6 +207,7 @@ enum {
> > NETDEV_A_DMABUF_QUEUES,
> > NETDEV_A_DMABUF_FD,
> > NETDEV_A_DMABUF_ID,
> > + NETDEV_A_DMABUF_AUTORELEASE,
> >
> > __NETDEV_A_DMABUF_MAX,
> > NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> >
> > --
> > 2.47.3
> >
>
>
> --
> Thanks,
> Mina