On Thu, 5 Sep 2024 09:24:58 +0000 Chengwen Feng <[email protected]> wrote:
> In the proactive error handling mode, the PMD will set the data path > pointers to dummy functions and then try recovery, in this period the > application may still invoking data path API. This will introduce a > race-condition with data path which may lead to crash [1]. > > Although the PMD added delay after setting data path pointers to cover > the above race-condition, it reduces the probability, but it doesn't > solve the problem. > > To solve the race-condition problem fundamentally, the following > requirements are added: > 1. The PMD should set the data path pointers to dummy functions after > report RTE_ETH_EVENT_ERR_RECOVERING event. > 2. The application should stop data path API invocation when process > the RTE_ETH_EVENT_ERR_RECOVERING event. > 3. The PMD should set the data path pointers to valid functions before > report RTE_ETH_EVENT_RECOVERY_SUCCESS event. > 4. The application should enable data path API invocation when process > the RTE_ETH_EVENT_RECOVERY_SUCCESS event. > > Also, this patch introduce a driver internal function > rte_eth_fp_ops_setup which used as an help function for PMD. > > [1] > http://patchwork.dpdk.org/project/dpdk/patch/[email protected]/ > > Fixes: eb0d471a8941 ("ethdev: add proactive error handling mode") > Cc: [email protected] This is not material for stable release, because of the impact to PMD etc. > > Signed-off-by: Chengwen Feng <[email protected]> > Acked-by: Konstantin Ananyev <[email protected]> > Acked-by: Huisong Li <[email protected]> ... > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h > index 548fada1c7..0aec5588e5 100644 > --- a/lib/ethdev/rte_ethdev.h > +++ b/lib/ethdev/rte_ethdev.h > @@ -4041,25 +4041,28 @@ enum rte_eth_event_type { > */ > RTE_ETH_EVENT_RX_AVAIL_THRESH, > /** Port recovering from a hardware or firmware error. > - * If PMD supports proactive error recovery, > - * it should trigger this event to notify application > - * that it detected an error and the recovery is being started. > - * Upon receiving the event, the application should not invoke any > control path API > - * (such as rte_eth_dev_configure/rte_eth_dev_stop...) until receiving > - * RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED > event. > - * The PMD will set the data path pointers to dummy functions, > - * and re-set the data path pointers to non-dummy functions > - * before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event. > - * It means that the application cannot send or receive any packets > - * during this period. > + * > + * If PMD supports proactive error recovery, it should trigger this > + * event to notify application that it detected an error and the > + * recovery is about to start. > + * > + * Upon receiving the event, the application should not invoke any > + * control and data path API until receiving > + * RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED > + * event. > + * > + * Once this event is reported, the PMD will set the data path pointers > + * to dummy functions, and re-set the data path pointers to valid > + * functions before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event. > + * Please use the IETF RFC conventions for wording here. Use "should" only when it is optional. In these cases the word "must" must be used. * If PMD supports proactive error recovery, it must trigger this ... > * @note Before the PMD reports the recovery result, > * the PMD may report the RTE_ETH_EVENT_ERR_RECOVERING event again, > * because a larger error may occur during the recovery. > */ > RTE_ETH_EVENT_ERR_RECOVERING, > /** Port recovers successfully from the error. > - * The PMD already re-configured the port, > - * and the effect is the same as a restart operation. > + * > + * The PMD already re-configured the port: > * a) The following operation will be retained: (alphabetically) > * - DCB configuration > * - FEC configuration > @@ -4086,6 +4089,9 @@ enum rte_eth_event_type { > * (@see RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP) > * c) Any other configuration will not be stored > * and will need to be re-configured. > + * > + * The application should restore some additional configuration > + * (see above case b/c), and then enable data path API invocation. > */ > RTE_ETH_EVENT_RECOVERY_SUCCESS, > /** Port recovery failed. > diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map > index 1669055ca5..da592b63bc 100644 > --- a/lib/ethdev/version.map > +++ b/lib/ethdev/version.map > @@ -346,6 +346,7 @@ INTERNAL { > rte_eth_devices; > rte_eth_dma_zone_free; > rte_eth_dma_zone_reserve; > + rte_eth_fp_ops_setup; > rte_eth_hairpin_queue_peer_bind; > rte_eth_hairpin_queue_peer_unbind; > rte_eth_hairpin_queue_peer_update; My other concern is that changing fp_ops on a running port is not safe. No part of eth_dev_fp_ops_setup() is atomic.

