Re: [net-next] net: mana: Trigger VF reset/recovery on health check failure due to HWC timeout

Dipayaan Roy Fri, 27 Feb 2026 00:07:41 -0800

On Tue, Feb 24, 2026 at 08:26:41AM +0000, Simon Horman wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> 
> For local reproduction steps see 
> https://netdev-ai.bots.linux.dev/ai-local.html
> ---
> net: mana: Trigger VF reset/recovery on health check failure due to HWC 
> timeout
> 
> The GF stats periodic query is used as mechanism to monitor HWC health
> check. If this HWC command times out, it is a strong indication that
> the device/SoC is in a faulty state and requires recovery.
> 
> Today, when a timeout is detected, the driver marks
> hwc_timeout_occurred, clears cached stats, and stops rescheduling the
> periodic work. However, the device itself is left in the same failing
> state.
> 
> Extend the timeout handling path to trigger the existing MANA VF
> recovery service by queueing a GDMA_EQE_HWC_RESET_REQUEST work item.
> This is expected to initiate the appropriate recovery flow by suspende
> resume first and if it fails then trigger a bus rescan.
> 
> This change is intentionally limited to HWC command timeouts and does
> not trigger recovery for errors reported by the SoC as a normal command
> response.
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c 
> > b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 0055c231acf6..16c438d2aaa3 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> 
> [ ... ]
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c 
> > b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index 9b5a72ada5c4..e6a4034b40f9 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> 
> [ ... ]
> 
> > @@ -3530,6 +3530,8 @@ static void mana_gf_stats_work_handler(struct 
> > work_struct *work)
> >  {
> >     struct mana_context *ac =
> >             container_of(to_delayed_work(work), struct mana_context, 
> > gf_stats_work);
> > +   struct gdma_context *gc = ac->gdma_dev->gdma_context;
> > +   struct mana_serv_work *mns_wk;
> >     int err;
> >
> >     err = mana_query_gf_stats(ac);
> > @@ -3537,6 +3539,30 @@ static void mana_gf_stats_work_handler(struct 
> > work_struct *work)
> >             /* HWC timeout detected - reset stats and stop rescheduling */
> >             ac->hwc_timeout_occurred = true;
> >             memset(&ac->hc_stats, 0, sizeof(ac->hc_stats));
> > +           dev_warn(gc->dev,
> > +                    "Gf stats wk handler: gf stats query timed out.\n");
> > +
> > +           /* As HWC timed out, indicating a faulty HW state and needs a
> > +            * reset.
> > +            */
> > +           if (!test_and_set_bit(GC_IN_SERVICE, &gc->flags)) {
> > +                   if (!try_module_get(THIS_MODULE)) {
> > +                           dev_info(gc->dev, "Module is unloading\n");
> > +                           return;
> > +                   }
> > +
> > +                   mns_wk = kzalloc(sizeof(*mns_wk), GFP_ATOMIC);
> > +                   if (!mns_wk) {
> > +                           module_put(THIS_MODULE);
> > +                           return;
> > +                   }
> 
> Does this leave the GC_IN_SERVICE flag permanently set? When
> test_and_set_bit() atomically sets the bit and then try_module_get()
> fails or kzalloc() fails, the function returns without clearing
> GC_IN_SERVICE.
> 
> This would permanently block all future recovery attempts. Looking at
> mana_gd_process_eqe() in gdma_main.c:
>
ACK. 
>       if (test_bit(GC_IN_SERVICE, &gc->flags)) {
>               dev_info(gc->dev, "Already in service\n");
>               break;
>       }
> 
> It would detect the set bit and skip the reset. Similarly, mana_tx_timeout()
> in mana_en.c would also short-circuit:
> 
>       if (test_bit(GC_IN_SERVICE, &gc->flags))
>               return;
> 
> The device cannot recover without manual intervention since the flag is only
> cleared by mana_serv_reset() at the end of the service work:
> 
> drivers/net/ethernet/microsoft/mana/gdma_main.c:mana_serv_reset()
> {
>       ...
> out:
>       clear_bit(GC_IN_SERVICE, &gc->flags);
> }
> 
> But since the work is never scheduled in these error paths, clear_bit() is
> never called.
> 
> Note that the existing mana_gd_process_eqe() path avoids this problem by
> calling set_bit(GC_IN_SERVICE) only AFTER all allocations succeed:
> 
> drivers/net/ethernet/microsoft/mana/gdma_main.c:mana_gd_process_eqe()
> {
>       ...
>       if (test_bit(GC_IN_SERVICE, &gc->flags)) {
>               dev_info(gc->dev, "Already in service\n");
>               break;
>       }
> 
>       if (!try_module_get(THIS_MODULE)) {
>               ...
>               break;
>       }
> 
>       mns_wk = kzalloc(sizeof(*mns_wk), GFP_KERNEL);
>       if (!mns_wk) {
>               module_put(THIS_MODULE);
>               break;
>       }
> 
>       dev_info(gc->dev, "Start MANA service type:%d\n", type);
>       set_bit(GC_IN_SERVICE, &gc->flags);  // Set only after success
>       ...
> }
> 
> Should clear_bit(GC_IN_SERVICE, &gc->flags) be added before each error
> return?
> 
ACK.
> > +
> > +                   mns_wk->pdev = to_pci_dev(gc->dev);
> > +                   mns_wk->type = GDMA_EQE_HWC_RESET_REQUEST;
> > +                   pci_dev_get(mns_wk->pdev);
> > +                   INIT_WORK(&mns_wk->serv_work, mana_serv_func);
> > +                   schedule_work(&mns_wk->serv_work);
> > +           }
> 
> This isn't a bug, but the recovery initiation sequence here is a
> near-duplicate of the pattern in mana_gd_process_eqe() (gdma_main.c).
> Both perform the identical 7-step sequence: try_module_get, kzalloc
> mana_serv_work, set pdev, set type, pci_dev_get, INIT_WORK, schedule_work.
> 
> The duplication already contributed to the GC_IN_SERVICE error path issue
> above, where this new copy diverged from the original in its error handling.
> 
> Consider extracting a common helper to avoid divergence.
> 
ACK.
> >             return;
> >     }
> >     schedule_delayed_work(&ac->gf_stats_work, MANA_GF_STATS_PERIOD);
> 
> [ ... ]


Thanks Simon, I am addressing these in v2.

Regards

Re: [net-next] net: mana: Trigger VF reset/recovery on health check failure due to HWC timeout

Reply via email to