mlx5: qos: Remove qos domains and use shd lock

Cosmin Ratiu Tue, 31 Mar 2026 05:55:45 -0700

On Mon, 2026-03-30 at 19:08 -0700, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
>     net/mlx5: qos: Remove qos domains and use shd lock
> 
> This commit removes QoS domains and switches to using the shd lock to
> protect against concurrent QoS modifications. It also enables the
> supported_cross_device_rate_nodes devlink ops attribute.
> 
> Link:
> https://lore.kernel.org/netdev/[email protected]/
> 
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
> > index 6698ac55a4bfa..c051605fecd23 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
> > @@ -385,6 +385,7 @@ static const struct devlink_ops
> > mlx5_devlink_ops = {
> >     .rate_node_del = mlx5_esw_devlink_rate_node_del,
> >     .rate_leaf_parent_set =
> > mlx5_esw_devlink_rate_leaf_parent_set,
> >     .rate_node_parent_set =
> > mlx5_esw_devlink_rate_node_parent_set,
> > +   .supported_cross_device_rate_nodes = true,
> 
> The commit message states that enabling
> supported_cross_device_rate_nodes
> now is safe because mlx5_esw_qos_vport_update_parent rejects cross-
> esw
> parent updates.
> 
> When a user provides a DEVLINK_ATTR_PARENT_DEV belonging to an
> unrelated
> device, the devlink core (devlink_nl_rate_parent_node_set) drops the
> lock on
> the parent devlink and calls devlink_rate_node_get_by_name(). This
> function
> traverses the rate_list of the unlocked parent devlink instance.
> 
> Does this leave the devlink core exposed to data races and potential
> use-after-free vulnerabilities, since the driver's rejection callback
> executes only after this lockless list traversal has occurred?


I will address this by adding common ancestor validation in devlink
rate core, which was missing.

> 
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
> > index 0be516003bcde..f67f99428959f 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
> > @@ -1110,7 +1068,7 @@ void mlx5_esw_qos_vport_disable(struct
> > mlx5_vport *vport)
> >     struct mlx5_esw_sched_node *parent;
> >  
> >     lockdep_assert_held(&esw->state_lock);
> > -   esw_qos_lock(esw);
> > +   devl_lock(esw->dev->shd);
> 
> Could esw->dev->shd be NULL here?
> 
> In mlx5_shd_init(), initialization is explicitly skipped for non-PF
> devices,
> meaning Sub-Functions (SFs) acting as E-Switch managers will have a
> NULL
> shd pointer. If the device lacks VPD data, pci_vpd_alloc() returns -
> ENODEV,
> causing mlx5_shd_init() to return 0 but leaving dev->shd as NULL.
> 
> When QoS operations are attempted on these devices, calling
> devl_lock()
> dereferences the pointer, which could cause an immediate NULL pointer
> dereference and kernel panic.

Right. This is a race with Jiri's fix ([1]), which changed the
assumption this code was architected with that dev->shd is always
initialized. That is no longer the case, and the consequence is that
this needs to 1) not unconditionally use dev->shd and 2) use a fallback
lock when it isn't available.

> 
> Additionally, this acquires the devlink lock while already holding
> esw->state_lock, establishing the lock order:
> state_lock -> devlink->lock.
> 
> Conversely, the devlink core invokes driver callbacks with the
> devlink
> instance lock held, and the driver subsequently acquires esw-
> >state_lock
> inside these callbacks, creating the reverse order:
> devlink->lock -> state_lock.
> 
> Does this introduce a circular locking dependency splat between
> state_lock
> and the devlink instance lock?

This claim is based on the incorrect assumption that devlink instances
share the same lock class, while in fact each is unique. The
established lock order is devlink->lock -> state_lock and this doesn't
conflict with state_lock -> shd->lock.

> 
> > @@ -1159,26 +1117,25 @@ int mlx5_esw_qos_set_vport_rate(struct
> > mlx5_vport *vport, u32 max_rate, u32 min_
> >     struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
> >     int err;
> >  
> > -   esw_qos_lock(esw);
> > +   devl_lock(esw->dev->shd);
> 
> This function is invoked by mlx5_eswitch_set_vport_rate(), which acts
> as the
> handler for the ndo_set_vf_rate netdev callback. The core network
> stack
> executes ndo_set_vf_rate with the rtnl_lock held, establishing the
> lock
> order: rtnl_lock -> devlink->lock.
> 
> However, the devlink subsystem enforces the opposite locking order
> (devlink->lock -> rtnl_lock). Because all devlink locks share the
> same
> lockdep class, does this inversion trigger a circular locking
> dependency
> lockdep splat and pose an ABBA deadlock risk?

Same as above, there's no ABBA here since devlink instances use
different lock classes.
Perhaps Sashiko (I think) could be taught this to reduce false
positives?

> 
> [ ... ]
> 
> > @@ -1859,13 +1763,15 @@ mlx5_esw_qos_vport_update_parent(struct
> > mlx5_vport *vport,
> >                                               parent ? : esw-
> > >qos.root,
> >                                               extack);
> >     }
> > -   esw_qos_unlock(esw);
> > +
> >     return err;
> >  }
> >  
> >  void mlx5_esw_qos_vport_clear_parent(struct mlx5_vport *vport)
> >  {
> > +   devl_lock(vport->dev->shd);
> >     mlx5_esw_qos_vport_update_parent(vport, NULL, NULL);
> > +   devl_unlock(vport->dev->shd);
> >  }
> 
> During device teardown, execution paths such as
> mlx5_unload_one_devl_locked()
> hold the primary devlink instance lock. This path invokes
> mlx5_eswitch_unload_vf_vports() ->
> mlx5_esw_offloads_devlink_port_unregister()
> -> mlx5_esw_qos_vport_clear_parent().
> 
> Because both the primary devlink instance lock and the shd devlink
> instance
> lock are initialized from the same site (devlink_alloc), they share
> the same
> lockdep class.
> 
> Does calling devl_lock() directly (instead of using the devlink
> core's
> mutex_lock_nested capability) while a lock of the same class is
> already held
> trigger a recursive locking lockdep warning?

Again, same false positive. Devlinks use unique lock classes.

Cosmin.

Re: [PATCH net-next V9 11/14] net/mlx5: qos: Remove qos domains and use shd lock

Reply via email to