On Tue, 2026-03-31 at 12:53 +0000, Cosmin Ratiu wrote: > On Mon, 2026-03-30 at 19:08 -0700, Jakub Kicinski wrote: > > This is an AI-generated review of your patch. The human sending > > this > > email has considered the AI review valid, or at least plausible. > > --- > > net/mlx5: qos: Remove qos domains and use shd lock > > > > This commit removes QoS domains and switches to using the shd lock > > to > > protect against concurrent QoS modifications. It also enables the > > supported_cross_device_rate_nodes devlink ops attribute. > > > > Link: > > https://lore.kernel.org/netdev/[email protected]/ > > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c > > > b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c > > > index 6698ac55a4bfa..c051605fecd23 100644 > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c > > > @@ -385,6 +385,7 @@ static const struct devlink_ops > > > mlx5_devlink_ops = { > > > .rate_node_del = mlx5_esw_devlink_rate_node_del, > > > .rate_leaf_parent_set = > > > mlx5_esw_devlink_rate_leaf_parent_set, > > > .rate_node_parent_set = > > > mlx5_esw_devlink_rate_node_parent_set, > > > + .supported_cross_device_rate_nodes = true, > > > > The commit message states that enabling > > supported_cross_device_rate_nodes > > now is safe because mlx5_esw_qos_vport_update_parent rejects cross- > > esw > > parent updates. > > > > When a user provides a DEVLINK_ATTR_PARENT_DEV belonging to an > > unrelated > > device, the devlink core (devlink_nl_rate_parent_node_set) drops > > the > > lock on > > the parent devlink and calls devlink_rate_node_get_by_name(). This > > function > > traverses the rate_list of the unlocked parent devlink instance. > > > > Does this leave the devlink core exposed to data races and > > potential > > use-after-free vulnerabilities, since the driver's rejection > > callback > > executes only after this lockless list traversal has occurred? > > I will address this by adding common ancestor validation in devlink > rate core, which was missing. > > > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c > > > b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c > > > index 0be516003bcde..f67f99428959f 100644 > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c > > > @@ -1110,7 +1068,7 @@ void mlx5_esw_qos_vport_disable(struct > > > mlx5_vport *vport) > > > struct mlx5_esw_sched_node *parent; > > > > > > lockdep_assert_held(&esw->state_lock); > > > - esw_qos_lock(esw); > > > + devl_lock(esw->dev->shd); > > > > Could esw->dev->shd be NULL here? > > > > In mlx5_shd_init(), initialization is explicitly skipped for non-PF > > devices, > > meaning Sub-Functions (SFs) acting as E-Switch managers will have a > > NULL > > shd pointer. If the device lacks VPD data, pci_vpd_alloc() returns > > - > > ENODEV, > > causing mlx5_shd_init() to return 0 but leaving dev->shd as NULL. > > > > When QoS operations are attempted on these devices, calling > > devl_lock() > > dereferences the pointer, which could cause an immediate NULL > > pointer > > dereference and kernel panic. > > Right. This is a race with Jiri's fix ([1]), which changed the > assumption this code was architected with that dev->shd is always > initialized. That is no longer the case, and the consequence is that > this needs to 1) not unconditionally use dev->shd and 2) use a > fallback > lock when it isn't available.
Forgot: [1] https://lore.kernel.org/netdev/[email protected]/

