>On 9/8/2023 1:59 AM, Mateusz Palczewski wrote:
>> In case rebuild fails trying to restart or unload a driver lead to
>> call trace.
>>
>> [ 128.876458] BUG: unable to handle kernel NULL pointer dereference at
>> 0000000000000000
>> [ 128.884417] PGD 33510d067 P4D 0
>> [ 128.887776] Oops: 0000 [#1] SMP NOPTI
>> [ 128.891571] CPU: 6 PID: 2141 Comm: ethtool Kdump: loaded Not tainted
>> 4.18.0-372.34.1.el8_6.ice.2117361.repr.x86_64 #1
>> [ 128.902308] Hardware name: ACCTON BRIGHTONCITY/BRIGHTONCITY, BIOS
>> IDVLCRB1.86B.0021.D40.2112020610 12/02/2021
>> [ 128.912351] RIP: 0010:ice_vsi_setup_tx_rings+0x26/0x80 [ice]
>> [ 128.918163] Code: 00 0f 1f 00 0f 1f 44 00 00 41 54 55 53 66 83 bf da
>> 03 00 00 00 48 89 fb 0f 84 c9 38 05 00 48 8b 47 28 41 bc 08 00 00 00 31 ed
>> <48> 8b 38 48 85 ff 75 21 eb 3d 0f b7 93 da 03 00 00 83 c5 01 39 ea
>> [ 128.937060] RSP: 0018:ff78a8b908183a10 EFLAGS: 00010246
>> [ 128.942417] RAX: 0000000000000000 RBX: ff4a4174c104e818 RCX:
>> 0000000000000a50
>> [ 128.949683] RDX: 0000000000000a4f RSI: 641269b8ad24144d RDI:
>> ff4a4174c104e818
>> [ 128.956946] RBP: 0000000000000000 R08: 0000000080000000 R09:
>> ff4a4173241b9b30
>> [ 128.964208] R10: ff4a4173241b9b30 R11: 0000000000000000 R12:
>> 0000000000000008
>> [ 128.971475] R13: ff4a417570650180 R14: 0000000000000001 R15:
>> 0000000000000001
>> [ 128.978741] FS: 00007f18a268e740(0000) GS:ff4a41823ff80000(0000)
>> knlGS:0000000000000000
>> [ 128.986961] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 128.992837] CR2: 0000000000000000 CR3: 0000000109e8e003 CR4:
>> 0000000000771ee0
>> [ 129.000103] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [ 129.007368] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>> 0000000000000400
>> [ 129.014635] PKRU: 55555554
>> [ 129.017473] Call Trace:
>> [ 129.020054] ice_vsi_open+0x29/0x120 [ice]
>> [ 129.024292] ice_vsi_recfg_qs+0xaa/0x110 [ice]
>> [ 129.028874] ice_set_channels+0x18d/0x3d0 [ice]
>> [ 129.033549] ethnl_set_channels+0x3a1/0x410
>> [ 129.037865] genl_family_rcv_msg_doit.isra.17+0x113/0x150
>> [ 129.043394] genl_family_rcv_msg+0xb7/0x170
>> [ 129.047708] ? channels_fill_reply+0x1a0/0x1a0
>> [ 129.052282] genl_rcv_msg+0x47/0xa0
>> [ 129.055901] ? genl_family_rcv_msg+0x170/0x170
>>
>> Fix this by setting ICE_RESET_FAILED when rebuild fails and clearing
>> number of qs when rings are free.
>
>What's the situation leading to the call trace? Do you know why we are failing
>rebuild?
It is manually done by injecting an error condition during queues allocation by
using module parameter.
>
>> Fixes: 6624e780a577 ("ice: split ice_vsi_setup into smaller
>> functions")
>> Reviewed-by: Michal Swiatkowski <[email protected]>
>> Signed-off-by: Mateusz Palczewski <[email protected]>
>> ---
>> v2: Add dev_err message about rebuild fail
>> ---
>> drivers/net/ethernet/intel/ice/ice_lib.c | 7 +++++++
>> drivers/net/ethernet/intel/ice/ice_main.c | 4 ++--
>> 2 files changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c
>> b/drivers/net/ethernet/intel/ice/ice_lib.c
>> index 0054d7e64ec3..09190457113a 100644
>> --- a/drivers/net/ethernet/intel/ice/ice_lib.c
>> +++ b/drivers/net/ethernet/intel/ice/ice_lib.c
>> @@ -321,6 +321,9 @@ static void ice_vsi_free_arrays(struct ice_vsi
>> *vsi)
>>
>> dev = ice_pf_to_dev(pf);
>>
>> + vsi->num_txq = 0;
>> + vsi->num_rxq = 0;
>> +
>> bitmap_free(vsi->af_xdp_zc_qps);
>> vsi->af_xdp_zc_qps = NULL;
>> /* free the ring and vector containers */ @@ -3157,6 +3160,7 @@ int
>> ice_vsi_rebuild(struct ice_vsi *vsi, u32 vsi_flags)
>> struct ice_coalesce_stored *coalesce;
>> int ret, prev_txq, prev_rxq;
>> int prev_num_q_vectors = 0;
>> + struct device *dev;
>> struct ice_pf *pf;
>>
>> if (!vsi)
>> @@ -3166,6 +3170,7 @@ int ice_vsi_rebuild(struct ice_vsi *vsi, u32 vsi_flags)
>> params.flags = vsi_flags;
>>
>> pf = vsi->back;
>> + dev = ice_pf_to_dev(pf);
>> if (WARN_ON(vsi->type == ICE_VSI_VF && !vsi->vf))
>> return -EINVAL;
>>
>> @@ -3206,6 +3211,8 @@ int ice_vsi_rebuild(struct ice_vsi *vsi, u32 vsi_flags)
>> ice_vsi_decfg(vsi);
>> err_vsi_cfg:
>> kfree(coalesce);
>> + set_bit(ICE_RESET_FAILED, pf->state);
>> + dev_err(dev, "Rebuild failed, unload and reload driver\n");
>
>How is this reproduced? How common is it? Forcing reload seems pretty
>disruptive.
As mentioned above it is done by manual error injection and was reported by a
customer.
>
>Also, no need for a local var; use ice_pf_to_dev() in the call.
Sure, will change that in v3.
>
>> return ret;
>> }
>>
>> diff --git a/drivers/net/ethernet/intel/ice/ice_main.c
>> b/drivers/net/ethernet/intel/ice/ice_main.c
>> index b40dfe6ae321..becf7f0fcd4c 100644
>> --- a/drivers/net/ethernet/intel/ice/ice_main.c
>> +++ b/drivers/net/ethernet/intel/ice/ice_main.c
>> @@ -4232,7 +4232,7 @@ static void ice_deinit_fdir(struct ice_pf *pf)
>> {
>> struct ice_vsi *vsi = ice_get_ctrl_vsi(pf);
>>
>> - if (!vsi)
>> + if (!vsi || test_bit(ICE_RESET_FAILED, pf->state))
>> return;
>>
>> ice_vsi_manage_fdir(vsi, false);
>> @@ -4746,7 +4746,7 @@ static void ice_deinit_pf_sw(struct ice_pf *pf)
>> {
>> struct ice_vsi *vsi = ice_get_main_vsi(pf);
>>
>> - if (!vsi)
>> + if (!vsi || test_bit(ICE_RESET_FAILED, pf->state))
>> return;
>>
>> ice_vsi_release(vsi);
>
Regards,
Mateusz
_______________________________________________
Intel-wired-lan mailing list
[email protected]
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan