RE: [EXTERNAL] Re: [PATCH net-next v6 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs

Long Li Mon, 04 May 2026 15:30:30 -0700

> From: 'Simon Horman' <[email protected]>
> 
> This is an AI-generated review of your patch. The human sending this email has
> considered the AI review valid, or at least plausible.
> Full review at:
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnetdev-
> ai.bots.linux.dev%2Fsashiko%2F&data=05%7C02%7Clongli%40microsoft.com%7
> Ca5d48bdebd0240e5d91308dea85ff3a8%7C72f988bf86f141af91ab2d7cd011db4
> 7%7C1%7C0%7C639133327234427106%7CUnknown%7CTWFpbGZsb3d8eyJFbX
> B0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbC
> IsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=4doWStoTQ4w7CbmhN3I294b%
> 2B9c3NNjDGySSmISHGZk4%3D&reserved=0
> 
> I apologise for any overlap with this and the review of this patchset 
> available at
> sashiko.dev. They both seem relevant and I'm unsure how to cleanly pass them
> both on.
> 
> Both Sashiko instances also flagged concerns in the remaining patches of this
> patchset. But I am not forwarding them as they seem to be lower priority, and 
> I
> think I've already created enough noise here.
> 
> ---
> net: mana: Query device capabilities and configure MSI-X sharing for EQs
> 
> This patch queries device capabilities early in GDMA probe and chooses between
> dedicated MSI-X per vPort or a shared-MSI-X mode based on the available vector
> count, introducing gc->max_num_queues_vport,
> gc->msi_sharing and gc->msi_bitmap, and renaming mana_query_device_cfg()
> to mana_gd_query_device_cfg() so it can be called from the GDMA layer.
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 098fbda0d128..b96859e0aec9 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > @@ -149,6 +149,9 @@ static int mana_gd_query_max_resources(struct
> pci_dev *pdev)
> >     struct gdma_context *gc = pci_get_drvdata(pdev);
> >     struct gdma_query_max_resources_resp resp = {};
> >     struct gdma_general_req req = {};
> > +   unsigned int max_num_queues;
> > +   u8 bm_hostmode;
> > +   u16 num_ports;
> >     int err;
> >
> >     mana_gd_init_req_hdr(&req.hdr, GDMA_QUERY_MAX_RESOURCES,
> @@ -194,6
> > +197,40 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> >     if (gc->max_num_queues > gc->num_msix_usable - 1)
> >             gc->max_num_queues = gc->num_msix_usable - 1;
> >
> > +   err = mana_gd_query_device_cfg(gc, MANA_MAJOR_VERSION,
> MANA_MINOR_VERSION,
> > +                                  MANA_MICRO_VERSION, &num_ports,
> &bm_hostmode);
> > +   if (err)
> > +           return err;
> 
> Is this the same MANA_QUERY_DEV_CONFIG that mana_probe() issues shortly
> after?  mana_probe() still contains:
> 
>       err = mana_gd_query_device_cfg(gc, MANA_MAJOR_VERSION,
> MANA_MINOR_VERSION,
>                                      MANA_MICRO_VERSION, &num_ports,
> &bm_hostmode);
> 
> so the same HWC round-trip runs twice on every probe with identical arguments,
> both populating gc->adapter_mtu.
> 
> The commit message frames this rename as consolidation ("used at GDMA device
> probe time"), but the mana_probe() call-site is not removed.  Was one of the 
> two
> call-sites meant to go away?
> 
> The relocation of debugfs_create_u16("adapter-MTU", ...) from
> mana_gd_query_device_cfg() into mana_probe() only makes sense if the function
> is now invoked more than once, which hints at the same duplication.


  The two calls are intentional and serve different layers. The call in 
mana_gd_query_max_resources() runs at the GDMA layer during device setup - it 
needs num_ports to compute per-vPort MSI-X
  allocation before any MANA-level structures exist. The call in mana_probe() 
runs at the MANA layer and consumes the results differently: it populates 
ac->num_ports, ac->bm_hostmode, and on
  resume validates that the port count hasn't changed.

  The HWC round-trip is lightweight (a single command/response exchange), and 
the function is renamed to mana_gd_ precisely because it now has callers at 
both layers. The debugfs move is the same
  reason - the debugfs node should only be created once, so it moves to 
mana_probe() which runs once, rather than staying in the function that is now 
called from both sites.

> 
> > +
> > +   if (!num_ports)
> > +           return -EINVAL;
> > +
> > +   /*
> > +    * Adjust gc->max_num_queues returned from the SOC to allow
> dedicated
> > +    * MSIx for each vPort. Clamp to no less than
> MANA_DEF_NUM_QUEUES.
> > +    */
> > +   max_num_queues = (gc->num_msix_usable - 1) / num_ports;
> > +   max_num_queues = roundup_pow_of_two(max(max_num_queues, 1U));
> > +   if (max_num_queues < MANA_DEF_NUM_QUEUES)
> > +           max_num_queues = MANA_DEF_NUM_QUEUES;
> > +
> > +   /*
> > +    * Use dedicated MSIx for EQs whenever possible, use MSIx sharing for
> > +    * Ethernet EQs when (max_num_queues * num_ports >
> num_msix_usable - 1)
> > +    */
> > +   max_num_queues = min(gc->max_num_queues, max_num_queues);
> > +   if (max_num_queues * num_ports > gc->num_msix_usable - 1)
> > +           gc->msi_sharing = true;
> 
> Is gc->msi_sharing ever reset to false?  The only two writers are this line 
> and
> mana_gd_setup_hwc_irqs() (the !pci_msix_can_alloc_dyn branch), and both only
> set it to true.  mana_gd_remove_irqs() frees msi_bitmap and zeros
> max_num_msix / num_msix_usable, but does not clear msi_sharing, and the
> gdma_context survives mana_gd_suspend() / mana_gd_resume().
> 
> Once true is latched in any setup cycle, mana_gd_setup() will always take the
> sharing branch on resume even if the recomputed max_num_queues * num_ports
> fits within num_msix_usable - 1.  Is that consistent with the commit message
> saying sharing "is only enabled when there are not enough MSI-X vectors for
> dedicated allocation"?

  msi_sharing is one-directional by design, but this is not a problem in 
practice. The inputs to the sharing computation - num_msix_usable, num_ports, 
and hardware queue limits - are properties of
  the hardware and do not change across suspend/resume cycles. The gdma_context 
is kzalloc'd at probe time (so msi_sharing starts as false), and on any given 
hardware configuration the computation
  in mana_gd_query_max_resources() will always produce the same result.

  On resume, mana_gd_setup_hwc_irqs() re-evaluates dynamic MSI-X support and 
mana_gd_query_max_resources() recomputes the sharing check from the same 
hardware parameters. Since the hardware
  doesn't change, the flag converges to the same value. Adding an explicit 
reset would be defensive but not functionally necessary.

> 
> > +
> > +   /* If MSI is shared, use max allowed value */
> > +   if (gc->msi_sharing)
> > +           gc->max_num_queues_vport = min(gc->num_msix_usable - 1,
> gc->max_num_queues);
> > +   else
> > +           gc->max_num_queues_vport = max_num_queues;
> > +
> > +   dev_info(gc->dev, "MSI sharing mode %d max queues %d\n",
> > +            gc->msi_sharing, gc->max_num_queues);
> 
> Should this print gc->max_num_queues_vport rather than
> gc->max_num_queues?  The block immediately above computes
> gc->max_num_queues_vport and leaves gc->max_num_queues unchanged, and
> mana_probe_port() sizes alloc_etherdev_mq() and apc->max_queues from
> gc->max_num_queues_vport:
> 
>       ndev = alloc_etherdev_mq(sizeof(struct mana_port_context),
>                                gc->max_num_queues_vport);
>       ...
>       apc->max_queues = gc->max_num_queues_vport;
>       apc->num_queues = min(gc->max_num_queues_vport,
> MANA_DEF_NUM_QUEUES);
> 
> so the logged value does not match the per-vPort queue count the driver 
> actually
> exposes.


  The printed value is intentional. gc->max_num_queues is the GDMA device-level 
maximum, which is consumed by both the Ethernet and RDMA drivers - the RDMA 
driver uses it to size its queue
  resources. Printing max_num_queues_vport here would only reflect the Ethernet 
per-vPort limit and would be misleading in a device-level log message. The 
per-vPort value is already visible
  through ethtool channel reporting.

> 
> > +
> >     return 0;
> >  }
> >
> > @@ -1856,6 +1893,7 @@ static int mana_gd_setup_hwc_irqs(struct pci_dev
> > *pdev)
> 
> [ ... ]
> 
> > @@ -1968,20 +2008,30 @@ static int mana_gd_setup(struct pci_dev *pdev)
> >     if (err)
> >             goto destroy_hwc;
> >
> > -   err = mana_gd_query_max_resources(pdev);
> > +   err = mana_gd_detect_devices(pdev);
> >     if (err)
> >             goto destroy_hwc;
> >
> > -   err = mana_gd_setup_remaining_irqs(pdev);
> > -   if (err) {
> > -           dev_err(gc->dev, "Failed to setup remaining IRQs: %d", err);
> > -           goto destroy_hwc;
> > -   }
> > -
> > -   err = mana_gd_detect_devices(pdev);
> > +   err = mana_gd_query_max_resources(pdev);
> >     if (err)
> >             goto destroy_hwc;
> >
> > +   if (!gc->msi_sharing) {
> > +           gc->msi_bitmap = bitmap_zalloc(gc->num_msix_usable,
> GFP_KERNEL);
> > +           if (!gc->msi_bitmap) {
> > +                   err = -ENOMEM;
> > +                   goto destroy_hwc;
> > +           }
> > +           /* Set bit for HWC */
> > +           set_bit(0, gc->msi_bitmap);
> > +   } else {
> > +           err = mana_gd_setup_remaining_irqs(pdev);
> > +           if (err) {
> > +                   dev_err(gc->dev, "Failed to setup remaining IRQs: %d",
> err);
> > +                   goto destroy_hwc;
> > +           }
> > +   }
> > +
> 
> Can the driver bring up any vPort after this change when the !msi_sharing 
> branch
> is taken?
> 
> In the dedicated branch, only gc->msi_bitmap is allocated and bit 0 is set for
> HWC.  mana_gd_setup_remaining_irqs() is skipped, so no gdma_irq_context is
> inserted into gc->irq_contexts for indices 1..
> num_msix_usable-1.
> 
> Later, mana_create_eq() still assigns
> 
>       spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
> 
> and mana_gd_register_irq() does:
> 
>       gic = xa_load(&gc->irq_contexts, msi_index);
>       if (WARN_ON(!gic))
>               return -EINVAL;
> 
> On a typical cloud SKU with, for example, num_msix_usable=32,
> num_ports=1 and num_online_cpus=16, the new math keeps msi_sharing=false
> (16 * 1 <= 31), so every EQ-create goes down this path and hits the WARN_ON.
> Doesn't that make every vPort open and every resume fail for the common
> dedicated-MSI-X case?
> 
> The msi_bitmap allocated here is not consumed anywhere in this commit; the on-
> demand allocation via mana_gd_get_gic() appears in the later commit "net:
> mana: Allocate interrupt context for each EQ when creating vPort"
> (dbbdf40a8974).  Should the bitmap and the new branch be introduced in the
> same commit that actually uses them, so each commit in the series is
> independently bootable?

  You're right that the non-sharing EQ creation path is not fully functional 
until patch 5 wires mana_gd_get_gic() into mana_create_eq(). However, this is a 
new capability being built
  incrementally: patch 2 introduces the decision framework and bitmap, patch 3 
adds the GIC infrastructure, patch 4 converts global IRQ setup to use it, and 
patch 5 integrates it into per-vPort EQ
  creation.

  The intermediate state between patches 2 and 5 results in a clean error 
(-EINVAL from mana_gd_register_irq) - not a crash or data corruption. The 
dedicated MSI-X mode is a new feature that did
  not exist before this series, so there is no regression from the pre-patch 
baseline. Restructuring to make it functional at each intermediate commit would 
require squashing the GIC
  infrastructure into this patch, producing a significantly larger and 
harder-to-review change. I'd prefer to keep the logical separation as-is.

RE: [EXTERNAL] Re: [PATCH net-next v6 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs

Reply via email to