> From: 'Simon Horman' <[email protected]>
>
> This is an AI-generated review of your patch. The human sending this email has
> considered the AI review valid, or at least plausible.
> Full review at:
> https://sashiko.d/
> ev%2F&data=05%7C02%7Clongli%40microsoft.com%7C36b10107f9774abb6894
> 08dea85f3721%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6391333
> 24077887564%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYi
> OiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0
> %7C%7C%7C&sdata=cS3FmP7jK6ftjo74XWvtBHNXYa7YKlthqLgzEudcmb4%3D&re
> served=0
> ---
> net: mana: Create separate EQs for each vPort
>
> This change moves EQ ownership from struct mana_context to struct
> mana_port_context so each vPort can have dedicated EQs, and exports
> mana_create_eq()/mana_destroy_eq() for use by the MANA RDMA driver.
>
> > diff --git a/drivers/infiniband/hw/mana/main.c
> > b/drivers/infiniband/hw/mana/main.c
> > index ac5e75dd3494..60cc02e4ad10 100644
> > --- a/drivers/infiniband/hw/mana/main.c
> > +++ b/drivers/infiniband/hw/mana/main.c
> > @@ -20,8 +20,10 @@ void mana_ib_uncfg_vport(struct mana_ib_dev *dev,
> struct mana_ib_pd *pd,
> >     pd->vport_use_count--;
> >     WARN_ON(pd->vport_use_count < 0);
> >
> > -   if (!pd->vport_use_count)
> > +   if (!pd->vport_use_count) {
> > +           mana_destroy_eq(mpc);
> >             mana_uncfg_vport(mpc);
> > +   }
> >
> >     mutex_unlock(&pd->vport_mutex);
> >  }
> > @@ -55,15 +57,21 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32
> port, struct mana_ib_pd *pd,
> >             return err;
> >     }
> >
> > -   mutex_unlock(&pd->vport_mutex);
> >
> >     pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
> >     pd->tx_vp_offset = mpc->tx_vp_offset;
> > +   err = mana_create_eq(mpc);
> > +   if (err) {
> > +           mana_uncfg_vport(mpc);
> > +           pd->vport_use_count--;
> > +   }
> > +
> > +   mutex_unlock(&pd->vport_mutex);
> >
> >     ibdev_dbg(&dev->ib_dev, "vport handle %llx pdid %x doorbell_id %x\n",
> >               mpc->port_handle, pd->pdn, doorbell_id);
>
> On the failure path, mana_uncfg_vport() is called and pd->vport_use_count is
> decremented, but execution falls through to the ibdev_dbg() line that prints 
> the
> vport handle as if it were a success. Should this debug print be skipped on 
> the
> error path so logs are not ambiguous?

  This is fair point. The ibdev_dbg could be guarded with if (!err) for 
clarity. That said, ibdev_dbg is compiled out in non-debug builds and the error 
itself is returned to the caller, so this is
  cosmetic. Happy to add the guard if you feel strongly.

>
> Also, pd->tx_shortform_allowed and pd->tx_vp_offset are assigned from mpc
> before the mana_create_eq() attempt and are not cleared when that call fails.
> Callers are expected to gate on pd->vport_use_count, but is it worth resetting
> these fields on failure so the state matches the pre-patch behavior?

  These fields are only consumed when pd->vport_use_count > 0. On failure the 
count is decremented back to 0, so the stale values are never read. The next 
successful mana_ib_cfg_vport() call will
  overwrite them. This matches pre-patch behavior - pre-patch also assigned 
these fields unconditionally and relied on the use count to gate access.

>
> The mutex_unlock(&pd->vport_mutex) has moved from before the field
> assignments to after mana_create_eq(). That now holds pd->vport_mutex across
> GFP_KERNEL allocations, debugfs directory creation, and multiple
> mana_gd_create_mana_eq() firmware RPCs, and establishes the lock order
> pd->vport_mutex -> apc->vport_mutex via mana_uncfg_vport() on the error
> path. Is the extended critical section intentional?

  Yes, this is intentional. Pre-patch, pd->vport_mutex was released before 
assigning pd->tx_shortform_allowed and pd->tx_vp_offset, which was a data race 
- a concurrent raw
  QP creation on the same PD could read partially-initialized state. The 
extended critical section ensures the vport is fully configured (including EQs) 
before any concurrent user can observe
  vport_use_count > 0.

>
> >
> > -   return 0;
> > +   return err;
> >  }
> >
> >  int mana_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata) diff
> > --git a/drivers/infiniband/hw/mana/qp.c
> > b/drivers/infiniband/hw/mana/qp.c index 645581359cee..6f1043383e8c
> > 100644
> > --- a/drivers/infiniband/hw/mana/qp.c
> > +++ b/drivers/infiniband/hw/mana/qp.c
> > @@ -168,7 +168,15 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp,
> struct ib_pd *pd,
> >             cq_spec.gdma_region = cq->queue.gdma_region;
> >             cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
> >             cq_spec.modr_ctx_id = 0;
> > -           eq = &mpc->ac->eqs[cq->comp_vector];
> > +           /* EQs are created when a raw QP configures the vport.
> > +            * A raw QP must be created before creating rwq_ind_tbl.
> > +            */
> > +           if (!mpc->eqs) {
> > +                   ret = -EINVAL;
> > +                   i--;
> > +                   goto fail;
> > +           }
> > +           eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
>
> Can the mpc->eqs read race with a concurrent free? mana_ib_create_qp_rss()
> runs without pd->vport_mutex or RTNL, but mpc->eqs is freed by
> mana_destroy_eq() from two paths:
>
>   mana_ib_uncfg_vport()   (under pd->vport_mutex, on last raw-QP destroy)
>   mana_dealloc_queues()   (under RTNL, on netdev down)
>
> both of which do:
>
>   kfree(apc->eqs);
>   apc->eqs = NULL;
>
> with no RCU grace period or reader-visible synchronization. If CPU-A passes
> the !mpc->eqs check after CPU-B begins ip link set dev X down, does CPU-A then
> dereference freed memory via mpc->eqs[...].eq->id?

  These two paths cannot run concurrently with RDMA QP creation on the same 
port. mana_cfg_vport() enforces mutual exclusion between Ethernet and RDMA via 
apc->vport_use_count:

       mutex_lock(&apc->vport_mutex);
       if (apc->vport_use_count > 0) {
               mutex_unlock(&apc->vport_mutex);
               return -EBUSY;
       }
       apc->vport_use_count++;

  If RDMA holds the vport (created a raw QP), Ethernet cannot bring the 
interface up, so mana_dealloc_queues() cannot run. If Ethernet holds the vport, 
mana_ib_cfg_vport() → mana_cfg_vport()
  returns -EBUSY and no RDMA raw QP is created, so mpc->eqs belongs exclusively 
to Ethernet.

  The mana_ib_uncfg_vport() path requires destroying the last raw QP on the PD, 
which means no new RDMA QPs should be in flight on that PD. The IB core 
serializes QP creation/destruction on the
  same device context.

>
> Separately, what populates mpc->eqs for an RDMA-only RSS QP user on a probed-
> but-idle Ethernet port? Pre-patch mana_probe() called
> mana_create_eq(ac) unconditionally, so ac->eqs existed for the device 
> lifetime.
> After this patch the only creators are mana_alloc_queues() (Ethernet up) and
> mana_ib_cfg_vport() (raw QP). An RDMA application that uses only RSS QPs and
> never creates a raw QP will now get -EINVAL here where it used to succeed. Is
> this intended, and should the commit log mention it?

  This is intentional. An RSS-only RDMA application (no raw QP) could never 
have worked in practice: mana_ib_create_qp_rss() calls mana_create_wq_obj(mpc, 
mpc->port_handle, ...) which requires the
  vport to be configured. Without a raw QP calling mana_ib_cfg_vport() → 
mana_cfg_vport(), mpc->port_handle is INVALID_MANA_HANDLE and the firmware call 
would fail. The -EINVAL is a cleaner early
  error for a path that was already broken.

>
> The adjacent comment:
>
>    /* EQs are created when a raw QP configures the vport.
>     * A raw QP must be created before creating rwq_ind_tbl.
>     */
>
> omits the Ethernet-up path that also populates mpc->eqs. Would it be clearer 
> to
> describe both creators?

  The comment is in the RDMA code path. Due to the mana_cfg_vport() mutual 
exclusion described above, when RDMA is executing this code, it owns the vport 
- so the EQs were created by the raw QP,
  not by Ethernet. The Ethernet path is not reachable when RDMA holds the port. 
The comment is accurate for the context it appears in.

>
> There is also a behavior change in the index expression:
>
>    eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
>
> Pre-patch this was ac->eqs[cq->comp_vector] sized by gc->max_num_queues.
> Now comp_vector is folded modulo mpc->num_queues, which is tunable via
> ethtool -L. Userspace that used distinct comp_vector values to hit distinct 
> EQs
> will silently alias when comp_vector >= num_queues. Should this be documented
> or rejected with -EINVAL rather than silently wrapped?

  The modulo wrap is the correct behavior. The EQ array is now sized to 
mpc->num_queues (the actual configured queue count), not the hardware maximum. 
Rejecting with -EINVAL would break
  applications that set comp_vector based on the device's total 
num_comp_vectors - which is the standard RDMA practice. Wrapping is what other 
drivers do in this situation (e.g., mlx5). Pre-patch
  the wrap just happened to be unnecessary because the array was oversized to 
the hardware max.

>
> Can mpc->num_queues be 0 at this point? mana_set_channels() does not reject
> new_count==0, and kzalloc_objs(struct mana_eq, 0) returns ZERO_SIZE_PTR,
> which passes the !mpc->eqs check. During the window between
> mana_create_eq() and the subsequent netif_set_real_num_tx_queues() failing, a
> concurrent RDMA QP create would compute
> cq->comp_vector % 0 here. Should mpc->num_queues be validated alongside
> mpc->eqs?

  num_queues cannot be 0. The ethtool core validates this before calling the 
driver - ethnl_set_channels() rejects the request if combined_count is 0 and 
there are no separate rx/tx channels
  (which MANA doesn't use). Additionally, num_queues is initialized at probe 
time to min(gc->max_num_queues_vport, MANA_DEF_NUM_QUEUES) where 
MANA_DEF_NUM_QUEUES is 16 and the hardware always
  reports at least 1 queue.

>
> The placement of the !mpc->eqs check is inside the per-iteration loop over
> ind_tbl_size, but mpc->eqs cannot change across iterations, so the check is 
> only
> meaningful at i==0. It works today because i-- then makes i = -1 and the 
> cleanup
> while (i-- > 0) skips. Would hoisting the check above the loop be clearer and 
> less
> fragile against refactoring?
>
> >             cq_spec.attached_eq = eq->eq->id;
> >
> >             ret = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_RQ,
> @@ -317,7
> > +325,11 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd
> *ibpd,
> >     cq_spec.queue_size = send_cq->cqe * COMP_ENTRY_SIZE;
> >     cq_spec.modr_ctx_id = 0;
> >     eq_vec = send_cq->comp_vector;
> > -   eq = &mpc->ac->eqs[eq_vec];
> > +   if (!mpc->eqs) {
> > +           err = -EINVAL;
> > +           goto err_destroy_queue;
> > +   }
> > +   eq = &mpc->eqs[eq_vec % mpc->num_queues];
>
> The same mpc->eqs TOCTOU concern applies here: the check and subsequent
> dereference occur without pd->vport_mutex or RTNL held, so a concurrent
> mana_dealloc_queues() on the Ethernet side can free mpc->eqs between the
> check and the index access. Is there synchronization that prevents this?

  Same answer as above: mana_cfg_vport() mutual exclusion prevents this. In 
mana_ib_create_qp_raw(), mana_ib_cfg_vport() is called just before this code, 
which calls mana_cfg_vport(). If Ethernet
  holds the port, that call returns -EBUSY and we never reach the mpc->eqs 
access. If we reach it, RDMA owns the vport and Ethernet cannot tear it down.

>
> The same comp_vector % num_queues aliasing and num_queues==0 divide
> concerns apply here as well.
>
> >     cq_spec.attached_eq = eq->eq->id;
> >
> >     err = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_SQ,
> &wq_spec,
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index a654b3699c4c..6c709f8b875d 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
>
> [ ... ]
>
> > -static int mana_create_eq(struct mana_context *ac)
> > +int mana_create_eq(struct mana_port_context *apc)
> >  {
> > -   struct gdma_dev *gd = ac->gdma_dev;
> > +   struct gdma_dev *gd = apc->ac->gdma_dev;
> >     struct gdma_context *gc = gd->gdma_context;
> >     struct gdma_queue_spec spec = {};
> >     int err;
> >     int i;
> >
> > -   ac->eqs = kzalloc_objs(struct mana_eq, gc->max_num_queues);
> > -   if (!ac->eqs)
> > +   WARN_ON(apc->eqs);
> > +   apc->eqs = kzalloc_objs(struct mana_eq, apc->num_queues);
> > +   if (!apc->eqs)
> >             return -ENOMEM;
>
> WARN_ON(apc->eqs) is used as a defensive assertion but execution proceeds to
> overwrite apc->eqs unconditionally. If the invariant is violated, the 
> previous array
> and all HW EQs it referenced are leaked.
> Should this instead bail out with an error or call mana_destroy_eq() first to 
> keep
> the function idempotent against future callers?


  The WARN_ON is a development assertion to catch double-init bugs. If it 
fires, there is already a logic error in the caller. Silently cleaning up and 
proceeding would mask the root cause. The
  standard kernel pattern for "this must never happen" is WARN_ON + continue, 
not silent recovery. If we wanted to be stricter, returning -EBUSY would be 
reasonable, but the current approach is
  consistent with kernel convention for invariant checks.

>
> [ ... ]
>
> > @@ -3326,6 +3349,9 @@ static int mana_dealloc_queues(struct net_device
> *ndev)
> >     mana_fence_rqs(apc);
> >
> >     /* Even in err case, still need to cleanup the vPort */
> > +   mana_destroy_rxqs(apc);
> > +   mana_destroy_txq(apc);
> > +   mana_destroy_eq(apc);
> >     mana_destroy_vport(apc);
>
> What is the intended interaction with outstanding RDMA QPs here? An RDMA
> user that created an RSS QP while the Ethernet netdev was up sees
> mpc->eqs populated by mana_alloc_queues(). When the admin issues
> ip link set dev N down, mana_dealloc_queues() -> mana_destroy_eq() destroys
> those EQs and frees mpc->eqs while the RDMA QPs are still alive, leaving the 
> QPs
> with dangling attached_eq IDs at the hardware level and stale kernel 
> references.

  This scenario cannot occur. As described above, mana_cfg_vport() enforces 
mutual exclusion - Ethernet and RDMA cannot hold the same port simultaneously. 
If Ethernet is up, RDMA raw QP creation
  on that port returns -EBUSY. If RDMA holds the port, mana_alloc_queues() → 
mana_create_vport() → mana_cfg_vport() returns -EBUSY and the interface fails 
to come up. There is no state where both
  have created EQs on the same mpc.

>
> Pre-patch ac->eqs lived for the full mana_context lifetime and was torn down
> only in mana_remove(). Is unconditionally destroying the EQs on netdev-down
> the intended new behavior, and if so how are concurrent RDMA consumers
> expected to recover?

Yes, destroying EQs on netdev-down is the intended new behavior, and no RDMA 
recovery path is needed because the scenario has no concurrent RDMA consumers.

mana_cfg_vport() enforces mutual exclusion at the hardware port level via 
apc->vport_use_count - it returns -EBUSY if the port is already held. Ethernet 
and RDMA cannot hold the same port
simultaneously:

- If Ethernet is up → RDMA raw QP creation fails at mana_cfg_vport() with 
-EBUSY → no RDMA EQs or QPs exist on that port → netdev-down destroys only 
Ethernet's own EQs.
- If RDMA holds the port → mana_alloc_queues() → mana_create_vport() → 
mana_cfg_vport() returns -EBUSY → interface never comes up → 
mana_dealloc_queues() never runs.

Pre-patch, the device-lifetime EQs in ac->eqs were shared across all ports and 
both subsystems, which masked this exclusivity - the EQs were always present 
regardless of who owned the port.
Post-patch, each port owns its EQs, and they follow the lifecycle of whoever 
holds the port. The exclusion guarantee means there is nothing to recover from.

Reply via email to