From: AI Reviewer <[email protected]>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://netdev-ai.bots.linux.dev/sashiko/
---
net: mana: Create separate EQs for each vPort
This patch moves EQ allocation from per-device to per-vPort, exports
mana_create_eq()/mana_destroy_eq() for use by the RDMA driver, takes
a vport reference for RSS QPs, and adds cross-port PD-sharing
rejection for raw QPs along with serialization between
mana_set_channels() and the RDMA vport configuration.
> diff --git a/drivers/infiniband/hw/mana/main.c
> b/drivers/infiniband/hw/mana/main.c
> index ac5e75dd3494..f8a9013f0ca3 100644
> --- a/drivers/infiniband/hw/mana/main.c
> +++ b/drivers/infiniband/hw/mana/main.c
> @@ -40,13 +42,27 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port,
> struct mana_ib_pd *pd,
>
> pd->vport_use_count++;
> if (pd->vport_use_count > 1) {
> + /* Reject cross-port PD sharing. EQs and vport config
> + * are per-port, so the PD must stay bound to the port
> + * that was configured on the first raw QP creation.
> + */
> + if (pd->vport_port != port) {
> + pd->vport_use_count--;
> + mutex_unlock(&pd->vport_mutex);
> + ibdev_dbg(&dev->ib_dev,
> + "PD already bound to port %u\n",
> + pd->vport_port);
> + return -EINVAL;
> + }
> ibdev_dbg(&dev->ib_dev,
> "Skip as this PD is already configured vport\n");
> mutex_unlock(&pd->vport_mutex);
> return 0;
> }
>
> - err = mana_cfg_vport(mpc, pd->pdn, doorbell_id);
> + pd->vport_port = port;
> +
> + err = mana_cfg_vport(mpc, pd->pdn, doorbell_id, true);
> if (err) {
> pd->vport_use_count--;
> mutex_unlock(&pd->vport_mutex);
[Low]
In the failure path of mana_ib_cfg_vport(), if mana_cfg_vport() or the
later mana_create_eq() fails, vport_use_count is rolled back to 0 but
pd->vport_port keeps the value just written.
Is the contract that pd->vport_port is only meaningful while
vport_use_count > 0? That invariant is not documented in the struct
mana_ib_pd comment, and a future reader who consults pd->vport_port
without first checking vport_use_count under vport_mutex would see stale
data. Would it be worth either resetting pd->vport_port on the rollback
path or documenting the rule explicitly?
> @@ -55,15 +71,23 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port,
> struct mana_ib_pd *pd,
> return err;
> }
>
> - mutex_unlock(&pd->vport_mutex);
>
> - pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
> - pd->tx_vp_offset = mpc->tx_vp_offset;
> + err = mana_create_eq(mpc);
> + if (err) {
> + mana_uncfg_vport(mpc);
> + pd->vport_use_count--;
> + } else {
> + pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
> + pd->tx_vp_offset = mpc->tx_vp_offset;
> + }
[ ... ]
> diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
> index 0fbcf449c134..108ec4c5ce51 100644
> --- a/drivers/infiniband/hw/mana/qp.c
> +++ b/drivers/infiniband/hw/mana/qp.c
> @@ -155,6 +156,18 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp,
> struct ib_pd *pd,
>
> qp->port = port;
>
> + /* Take a reference on the vport to ensure EQs outlive this QP.
> + * The vport must already be configured by a raw QP.
> + */
> + mutex_lock(&mana_pd->vport_mutex);
> + if (!mana_pd->vport_use_count) {
> + mutex_unlock(&mana_pd->vport_mutex);
> + ret = -EINVAL;
> + goto fail;
> + }
> + mana_pd->vport_use_count++;
> + mutex_unlock(&mana_pd->vport_mutex);
> +
[Critical]
The cross-port PD-sharing check added to mana_ib_cfg_vport() is not
mirrored here. mana_ib_create_qp_rss() takes the user-supplied port
from ucmd.port and only checks that vport_use_count is non-zero, so the
RSS path will happily attach to a different port than the one bound to
the PD by the first raw QP.
If a user creates a raw QP on port A (which sets pd->vport_port=A and
allocates mpc_A->eqs via mana_create_eq()), then creates an RSS QP on
the same PD with ucmd.port=B where port B has no Ethernet up and no
RDMA activity, mpc_B->eqs is NULL. The subsequent code in this same
function:
eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
cq_spec.attached_eq = eq->eq->id;
would then dereference NULL through mpc_B->eqs.
If port B does have Ethernet up, mpc_B->eqs is owned by the Ethernet
driver and the RSS QP attaches to those EQs. When the QP is destroyed
mana_ib_destroy_qp_rss() calls mana_ib_uncfg_vport(mdev, pd, qp->port=B),
and once pd->vport_use_count reaches 0 mana_ib_uncfg_vport() will run
mana_destroy_eq(mpc_B) on Ethernet's live EQs and call mana_uncfg_vport
on a port whose apc->vport_use_count was never incremented by this PD,
tripping the WARN_ON in mana_uncfg_vport() and corrupting Ethernet's
vport state. Meanwhile port A's EQs and vport configuration are leaked
because nothing on this PD will destroy them.
Should mana_ib_create_qp_rss() apply the same pd->vport_port == port
check that mana_ib_cfg_vport() now performs, before bumping
vport_use_count?
[Medium]
The new "vport must already be configured by a raw QP" requirement is a
visible uAPI change. Before this patch, EQs were allocated device-wide
on mana_context at probe time, so an RSS QP could be created without any
prior raw QP on the PD. After the patch, applications that legitimately
use only RSS QPs (RX-only RDMA workloads) will see -EINVAL here unless
they also create a raw QP first.
Was breaking the RSS-only case considered, and would lazy EQ creation on
the first QP regardless of type be feasible to avoid the
incompatibility?
> for (i = 0; i < ind_tbl_size; i++) {
> struct mana_obj_spec wq_spec = {};
> struct mana_obj_spec cq_spec = {};
> @@ -171,13 +184,13 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp,
> struct ib_pd *pd,
> cq_spec.gdma_region = cq->queue.gdma_region;
> cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
> cq_spec.modr_ctx_id = 0;
> - eq = &mpc->ac->eqs[cq->comp_vector];
> + eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
> cq_spec.attached_eq = eq->eq->id;
[ ... ]
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> index 04350973e19e..4633acc976f0 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> @@ -462,10 +467,22 @@ static int mana_set_channels(struct net_device *ndev,
> unsigned int new_count = channels->combined_count;
> unsigned int old_count = apc->num_queues;
> int err;
>
> + /* Set channel_changing to block RDMA from grabbing the vport
> + * during the detach/attach window. mana_cfg_vport() checks
> + * this flag under vport_mutex and returns -EBUSY if set.
> + */
> + mutex_lock(&apc->vport_mutex);
> + if (!apc->port_is_up && apc->vport_use_count) {
> + mutex_unlock(&apc->vport_mutex);
> + return -EBUSY;
> + }
> + apc->channel_changing = true;
> + mutex_unlock(&apc->vport_mutex);
[Low]
The commit message says:
When the port is down, apc->vport_mutex is held for the entire
operation since mana_detach()/mana_attach() do not take vport_mutex
in that case. When the port is up, Ethernet owns the vport
exclusively so no additional locking is needed.
Does this description match the code? mana_set_channels() drops
apc->vport_mutex immediately after setting channel_changing=true and
runs mana_pre_alloc_rxbufs(), mana_detach() and mana_attach() without
the mutex held, then re-acquires the mutex only at the end to clear the
flag.
The actual serialization mechanism is the channel_changing boolean
checked by mana_cfg_vport() under vport_mutex, not a long mutex hold.
Could the commit message be updated to match the implemented design so
that future readers do not assume a different invariant?
--
This is an AI-generated review.