On Tue, Feb 24, 2026 at 08:26:41AM +0000, Simon Horman wrote: > This is an AI-generated review of your patch. The human sending this > email has considered the AI review valid, or at least plausible. > > For local reproduction steps see > https://netdev-ai.bots.linux.dev/ai-local.html > --- > net: mana: Trigger VF reset/recovery on health check failure due to HWC > timeout > > The GF stats periodic query is used as mechanism to monitor HWC health > check. If this HWC command times out, it is a strong indication that > the device/SoC is in a faulty state and requires recovery. > > Today, when a timeout is detected, the driver marks > hwc_timeout_occurred, clears cached stats, and stops rescheduling the > periodic work. However, the device itself is left in the same failing > state. > > Extend the timeout handling path to trigger the existing MANA VF > recovery service by queueing a GDMA_EQE_HWC_RESET_REQUEST work item. > This is expected to initiate the appropriate recovery flow by suspende > resume first and if it fails then trigger a bus rescan. > > This change is intentionally limited to HWC command timeouts and does > not trigger recovery for errors reported by the SoC as a normal command > response. > > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c > > b/drivers/net/ethernet/microsoft/mana/gdma_main.c > > index 0055c231acf6..16c438d2aaa3 100644 > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c > > [ ... ] > > > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c > > b/drivers/net/ethernet/microsoft/mana/mana_en.c > > index 9b5a72ada5c4..e6a4034b40f9 100644 > > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c > > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c > > [ ... ] > > > @@ -3530,6 +3530,8 @@ static void mana_gf_stats_work_handler(struct > > work_struct *work) > > { > > struct mana_context *ac = > > container_of(to_delayed_work(work), struct mana_context, > > gf_stats_work); > > + struct gdma_context *gc = ac->gdma_dev->gdma_context; > > + struct mana_serv_work *mns_wk; > > int err; > > > > err = mana_query_gf_stats(ac); > > @@ -3537,6 +3539,30 @@ static void mana_gf_stats_work_handler(struct > > work_struct *work) > > /* HWC timeout detected - reset stats and stop rescheduling */ > > ac->hwc_timeout_occurred = true; > > memset(&ac->hc_stats, 0, sizeof(ac->hc_stats)); > > + dev_warn(gc->dev, > > + "Gf stats wk handler: gf stats query timed out.\n"); > > + > > + /* As HWC timed out, indicating a faulty HW state and needs a > > + * reset. > > + */ > > + if (!test_and_set_bit(GC_IN_SERVICE, &gc->flags)) { > > + if (!try_module_get(THIS_MODULE)) { > > + dev_info(gc->dev, "Module is unloading\n"); > > + return; > > + } > > + > > + mns_wk = kzalloc(sizeof(*mns_wk), GFP_ATOMIC); > > + if (!mns_wk) { > > + module_put(THIS_MODULE); > > + return; > > + } > > Does this leave the GC_IN_SERVICE flag permanently set? When > test_and_set_bit() atomically sets the bit and then try_module_get() > fails or kzalloc() fails, the function returns without clearing > GC_IN_SERVICE. > > This would permanently block all future recovery attempts. Looking at > mana_gd_process_eqe() in gdma_main.c: > ACK. > if (test_bit(GC_IN_SERVICE, &gc->flags)) { > dev_info(gc->dev, "Already in service\n"); > break; > } > > It would detect the set bit and skip the reset. Similarly, mana_tx_timeout() > in mana_en.c would also short-circuit: > > if (test_bit(GC_IN_SERVICE, &gc->flags)) > return; > > The device cannot recover without manual intervention since the flag is only > cleared by mana_serv_reset() at the end of the service work: > > drivers/net/ethernet/microsoft/mana/gdma_main.c:mana_serv_reset() > { > ... > out: > clear_bit(GC_IN_SERVICE, &gc->flags); > } > > But since the work is never scheduled in these error paths, clear_bit() is > never called. > > Note that the existing mana_gd_process_eqe() path avoids this problem by > calling set_bit(GC_IN_SERVICE) only AFTER all allocations succeed: > > drivers/net/ethernet/microsoft/mana/gdma_main.c:mana_gd_process_eqe() > { > ... > if (test_bit(GC_IN_SERVICE, &gc->flags)) { > dev_info(gc->dev, "Already in service\n"); > break; > } > > if (!try_module_get(THIS_MODULE)) { > ... > break; > } > > mns_wk = kzalloc(sizeof(*mns_wk), GFP_KERNEL); > if (!mns_wk) { > module_put(THIS_MODULE); > break; > } > > dev_info(gc->dev, "Start MANA service type:%d\n", type); > set_bit(GC_IN_SERVICE, &gc->flags); // Set only after success > ... > } > > Should clear_bit(GC_IN_SERVICE, &gc->flags) be added before each error > return? > ACK. > > + > > + mns_wk->pdev = to_pci_dev(gc->dev); > > + mns_wk->type = GDMA_EQE_HWC_RESET_REQUEST; > > + pci_dev_get(mns_wk->pdev); > > + INIT_WORK(&mns_wk->serv_work, mana_serv_func); > > + schedule_work(&mns_wk->serv_work); > > + } > > This isn't a bug, but the recovery initiation sequence here is a > near-duplicate of the pattern in mana_gd_process_eqe() (gdma_main.c). > Both perform the identical 7-step sequence: try_module_get, kzalloc > mana_serv_work, set pdev, set type, pci_dev_get, INIT_WORK, schedule_work. > > The duplication already contributed to the GC_IN_SERVICE error path issue > above, where this new copy diverged from the original in its error handling. > > Consider extracting a common helper to avoid divergence. > ACK. > > return; > > } > > schedule_delayed_work(&ac->gf_stats_work, MANA_GF_STATS_PERIOD); > > [ ... ]
Thanks Simon, I am addressing these in v2. Regards

