> Subject: [PATCH v5] net/iavf: fix duplicate VF reset during PF reset recovery
>
> During PF initiated reset recovery, iavf_dev_close() sending
> an extra VIRTCHNL_OP_RESET_VF while recovery is already in progress.
> That second reset can leave PF/VF virtchnl state inconsistent and
> cause VIRTCHNL_OP_CONFIG_VSI_QUEUES to fail with ERR_PARAM after
> ToR link flap/power-cycle, leaving the VF unable to recover.
> This results in connection loss.
>
> This patch introduces a new flag 'pf_reset_in_progress', that is
> set only when iavf_handle_hw_reset() is entered with
> vf_initiated_reset as false and is cleared on exit.
> Also, close-time VF reset and related close-time virtchnl
> operations are skipped when PF triggered reset recovery is set.
> This is done to avoid a duplicate VF reset, and keep normal
> behavior for application-driven close or VF-initiated reinit.
>
> Fixes: 675a104e2e94 ("net/iavf: fix abnormal disable HW interrupt")
> Fixes: b34fe66ea893 ("net/iavf: delay VF reset command")
> Fixes: 5e03e316c753 ("net/iavf: handle virtchnl event message without
> interrupt")
> Cc: [email protected]
>
> Signed-off-by: Anurag Mandal <[email protected]>
Acked-by: Ciara Loftus <[email protected]>
I think you may need to respin due to patch application failure.
I have some suggestions for improving the comments/release notes
that you could include in the next version. Code looks good to me.
> ---
> V5: Addressed Ciara Loftus's comments
> - added separate flag for PF initiated reset recovery
> V4: Addressed Ciara Loftus's comments
> - split VF reset from other code changes
> V3: Addressed latest ai-code-review comments
> V2: Addressed ai-code-review comments
>
> doc/guides/rel_notes/release_26_07.rst | 3 ++
> drivers/net/intel/iavf/iavf.h | 7 +++++
> drivers/net/intel/iavf/iavf_ethdev.c | 40 +++++++++++++++-----------
> drivers/net/intel/iavf/iavf_vchnl.c | 18 ++++++++++--
> 4 files changed, 49 insertions(+), 19 deletions(-)
>
> diff --git a/doc/guides/rel_notes/release_26_07.rst
> b/doc/guides/rel_notes/release_26_07.rst
> index d2563ac503..f6899a78c3 100644
> --- a/doc/guides/rel_notes/release_26_07.rst
> +++ b/doc/guides/rel_notes/release_26_07.rst
> @@ -95,6 +95,9 @@ New Features
>
> * Added support for transmitting LLDP packets based on mbuf packet type.
> * Implemented AVX2 context descriptor transmit paths.
> + * Prevented duplicate 'VIRTCHNL_OP_RESET_VF' during a PF-initiated
> + reset recovery, which earlier caused virtchnl state corruption
> + and connection loss after a top-of-rack (ToR) link flap/power-cycle.
I think something more concise here would be better.
eg. "Fixed duplicate send of 'VIRTCHNL_OP_RESET_VF' during PF reset
recovery which could cause virtchnl state corruption"
>
> * **Updated PCAP ethernet driver.**
>
> diff --git a/drivers/net/intel/iavf/iavf.h b/drivers/net/intel/iavf/iavf.h
> index 2615b6f034..67aacbe7a6 100644
> --- a/drivers/net/intel/iavf/iavf.h
> +++ b/drivers/net/intel/iavf/iavf.h
> @@ -292,6 +292,13 @@ struct iavf_info {
>
> bool in_reset_recovery;
>
> + /*
> + * Set only while iavf_handle_hw_reset()
> + * is processing a PF-initiated reset
> + * (vf_initiated_reset == false).
> + */
I don't think a comment is warranted here, the variable name is
self-explanatory.
> + bool pf_reset_in_progress;
> +
> uint32_t ptp_caps;
> rte_spinlock_t phc_time_aq_lock;
> };
> diff --git a/drivers/net/intel/iavf/iavf_ethdev.c
> b/drivers/net/intel/iavf/iavf_ethdev.c
> index a8031e23a5..2b6f4daa99 100644
> --- a/drivers/net/intel/iavf/iavf_ethdev.c
> +++ b/drivers/net/intel/iavf/iavf_ethdev.c
> @@ -3166,23 +3166,27 @@ iavf_dev_close(struct rte_eth_dev *dev)
>
> ret = iavf_dev_stop(dev);
>
> - /*
> - * Release redundant queue resource when close the dev
> - * so that other vfs can re-use the queues.
> - */
> - if (vf->lv_enabled) {
> - ret = iavf_request_queues(dev,
> IAVF_MAX_NUM_QUEUES_DFLT);
> - if (ret)
> - PMD_DRV_LOG(ERR, "Reset the num of queues
> failed");
> + /* Skip RESET_VF on a PF-initiated reset */
Regarding the comment above, here we're not skipping RESET_VF rather
preventing sending virtchnl messages to the adminq during the PF-initiated
reset. I suggest rewording the comment to reflect that.
> + if (!vf->pf_reset_in_progress) {
>
> - vf->max_rss_qregion = IAVF_MAX_NUM_QUEUES_DFLT;
> - }
> + /*
> + * Release redundant queue resource when close the dev
> + * so that other vfs can re-use the queues.
> + */
> + if (vf->lv_enabled) {
> + ret = iavf_request_queues(dev,
> IAVF_MAX_NUM_QUEUES_DFLT);
> + if (ret)
> + PMD_DRV_LOG(ERR, "Reset the num of
> queues failed");
> + vf->max_rss_qregion =