On Wed, Jan 28, 2026 at 6:32 AM Tony Nguyen <[email protected]> wrote: > > > > On 12/24/2025 10:21 PM, Aaron Ma wrote: > > Fix IRDMA hardware initialization timeout (-110) after resume by > > separating VSI-dependent configuration from RDMA resource allocation, > > ensuring VSI is rebuilt before IRDMA accesses it. > > > > After resume from suspend, IRDMA hardware initialization fails: > > ice: IRDMA hardware initialization FAILED init_state=4 status=-110 > > > > Separate RDMA initialization into two phases: > > 1. ice_init_rdma() - Allocate resources only (no VSI/QoS access, no plug) > > 2. ice_rdma_finalize_setup() - Assign VSI/QoS info and plug device > > > > This allows: > > - ice_init_rdma() to stay in ice_resume() (mirrors ice_deinit_rdma() > > in ice_suspend() > > - VSI assignment deferred until after ice_vsi_rebuild() completes > > - QoS info updated after ice_dcb_rebuild() completes > > - Device plugged only when control queues, VSI, and DCB are all ready > Hi Aaron, > > Sorry for the late feedback, but I'm working on getting AI Review in > place and when I ran it against this path it flagged a couple of things... > > > Fixes: bc69ad74867db ("ice: avoid IRQ collision to fix init failure on ACPI > > S3 resume") > > Reviewed-by: Aleksandr Loktionov <[email protected]> > > Signed-off-by: Aaron Ma <[email protected]> > > --- > > V1 -> V2: no changes. > > V2 -> V3: > > - mirrors init_rdma in resume as Tony Nguyen suggested to fix > > the memleak and move ice_plug_aux_dev/ice_unplug_aux_dev out of > > init/deinit rdma. > > - ensure the correct VSI/QoS info is loaded after rebuild. > > > > drivers/net/ethernet/intel/ice/ice.h | 1 + > > drivers/net/ethernet/intel/ice/ice_idc.c | 41 +++++++++++++++++------ > > drivers/net/ethernet/intel/ice/ice_main.c | 7 +++- > > 3 files changed, 38 insertions(+), 11 deletions(-) > > > > diff --git a/drivers/net/ethernet/intel/ice/ice.h > > b/drivers/net/ethernet/intel/ice/ice.h > > index 147aaee192a79..6463c1fea7871 100644 > > --- a/drivers/net/ethernet/intel/ice/ice.h > > +++ b/drivers/net/ethernet/intel/ice/ice.h > > @@ -989,6 +989,7 @@ int ice_schedule_reset(struct ice_pf *pf, enum > > ice_reset_req reset); > > void ice_print_link_msg(struct ice_vsi *vsi, bool isup); > > int ice_plug_aux_dev(struct ice_pf *pf); > > void ice_unplug_aux_dev(struct ice_pf *pf); > > +void ice_rdma_finalize_setup(struct ice_pf *pf); > > int ice_init_rdma(struct ice_pf *pf); > > void ice_deinit_rdma(struct ice_pf *pf); > > bool ice_is_wol_supported(struct ice_hw *hw); > > diff --git a/drivers/net/ethernet/intel/ice/ice_idc.c > > b/drivers/net/ethernet/intel/ice/ice_idc.c > > index 420d45c2558b6..b6079a6cb7736 100644 > > --- a/drivers/net/ethernet/intel/ice/ice_idc.c > > +++ b/drivers/net/ethernet/intel/ice/ice_idc.c > > @@ -360,6 +360,35 @@ void ice_unplug_aux_dev(struct ice_pf *pf) > > auxiliary_device_uninit(adev); > > } > > > > +/** > > + * ice_rdma_finalize_setup - Complete RDMA setup after VSI is ready > > + * @pf: ptr to ice_pf > > + * > > + * Sets VSI-dependent information and plugs aux device. > > + * Must be called after ice_init_rdma(), ice_vsi_rebuild(), and > > + * ice_dcb_rebuild() complete. > > + */ > > +void ice_rdma_finalize_setup(struct ice_pf *pf) > > +{ > > + struct iidc_rdma_priv_dev_info *privd; > > + > > + if (!ice_is_rdma_ena(pf) || !pf->cdev_info) > > + return; > > + > > + privd = pf->cdev_info->iidc_priv; > > + if (!privd || !pf->vsi[0] || !pf->vsi[0]->netdev)
Aleksandr raised this change to: if (!privd || !pf->vsi || !pf->vsi[0] || !pf->vsi[0]->netdev) I will add this in v4. > > + return; > > + > > + /* Assign VSI info now that VSI is valid */ > > + privd->netdev = pf->vsi[0]->netdev; > > + privd->vport_id = pf->vsi[0]->vsi_num; > > + > > + /* Update QoS info after DCB has been rebuilt */ > > + ice_setup_dcb_qos_info(pf, &privd->qos_info); > > + > > + ice_plug_aux_dev(pf); > > " > Does making ice_rdma_finalize_setup() void lose error reporting? The > old ice_init_rdma() called ice_plug_aux_dev() and propagated errors to > callers with error cleanup. Now ice_plug_aux_dev() returns int but > ice_rdma_finalize_setup() is void, so the return value is ignored. > > If ice_plug_aux_dev() fails (kzalloc failure, auxiliary_device_init > failure, or auxiliary_device_add failure), the error is silently > ignored and RDMA functionality won't be available. The old code would > return an error that ice_load() and ice_resume() would log, helping > users understand why RDMA isn't working. > > Should ice_rdma_finalize_setup() return int and propagate the error, or > at least log a message when ice_plug_aux_dev() fails? > " > > I agree that logging a message would be helpful to inform the user. > Either here or propagating it and printing a differentiated messages > would be nice. Aleksandr also raised this. Since ice_rebuild() which called this function can't propagate errors, so I think logging a dev_warn() when ice_plug_aux_dev() fails is more appropriate. > > > +} > > + > > /** > > * ice_init_rdma - initializes PF for RDMA use > > * @pf: ptr to ice_pf > > @@ -398,23 +427,16 @@ int ice_init_rdma(struct ice_pf *pf) > > } > > > > cdev->iidc_priv = privd; > > - privd->netdev = pf->vsi[0]->netdev; > > > > privd->hw_addr = (u8 __iomem *)pf->hw.hw_addr; > > cdev->pdev = pf->pdev; > > - privd->vport_id = pf->vsi[0]->vsi_num; > > > > pf->cdev_info->rdma_protocol |= IIDC_RDMA_PROTOCOL_ROCEV2; > > - ice_setup_dcb_qos_info(pf, &privd->qos_info); > > - ret = ice_plug_aux_dev(pf); > > - if (ret) > > - goto err_plug_aux_dev; > > + > > return 0; > > > > -err_plug_aux_dev: > > - pf->cdev_info->adev = NULL; > > - xa_erase(&ice_aux_id, pf->aux_idx); > > err_alloc_xa: > > + xa_erase(&ice_aux_id, pf->aux_idx); > > " > Does this error path call xa_erase() with an uninitialized pf->aux_idx? > > When xa_alloc() fails at line 423, the code jumps to err_alloc_xa which > now calls xa_erase(). But if xa_alloc() failed, nothing was allocated > into the xarray, and pf->aux_idx contains an undefined value from the > failed xa_alloc() call. Calling xa_erase() with this undefined index > seems incorrect. > > In the original code, the err_plug_aux_dev label came after successful > xa_alloc(), so xa_erase() was only called when something was actually > allocated. The err_alloc_xa label came after err_plug_aux_dev and fell > through without calling xa_erase(). > " > Right, after xa_alloc failed, it should not call xa_rease, I will remove it. I will send v4 patch to fix 3 issues above. Thanks for the detail review. > Semi-related. I sent patch 1 of this series on since it seems like it > can be independent of this one so no need to carry that one with this. > Thank you very much. Aaron > Thanks, > Tony > > > kfree(privd); > > err_privd_alloc: > > kfree(cdev);
