On Thu, 5 Sep 2024 09:24:58 +0000
Chengwen Feng <[email protected]> wrote:

> In the proactive error handling mode, the PMD will set the data path
> pointers to dummy functions and then try recovery, in this period the
> application may still invoking data path API. This will introduce a
> race-condition with data path which may lead to crash [1].
> 
> Although the PMD added delay after setting data path pointers to cover
> the above race-condition, it reduces the probability, but it doesn't
> solve the problem.
> 
> To solve the race-condition problem fundamentally, the following
> requirements are added:
> 1. The PMD should set the data path pointers to dummy functions after
>    report RTE_ETH_EVENT_ERR_RECOVERING event.
> 2. The application should stop data path API invocation when process
>    the RTE_ETH_EVENT_ERR_RECOVERING event.
> 3. The PMD should set the data path pointers to valid functions before
>    report RTE_ETH_EVENT_RECOVERY_SUCCESS event.
> 4. The application should enable data path API invocation when process
>    the RTE_ETH_EVENT_RECOVERY_SUCCESS event.
> 
> Also, this patch introduce a driver internal function
> rte_eth_fp_ops_setup which used as an help function for PMD.
> 
> [1] 
> http://patchwork.dpdk.org/project/dpdk/patch/[email protected]/
> 
> Fixes: eb0d471a8941 ("ethdev: add proactive error handling mode")
> Cc: [email protected]

This is not material for stable release, because of the impact to PMD etc.

> 
> Signed-off-by: Chengwen Feng <[email protected]>
> Acked-by: Konstantin Ananyev <[email protected]>
> Acked-by: Huisong Li <[email protected]>

...

> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 548fada1c7..0aec5588e5 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -4041,25 +4041,28 @@ enum rte_eth_event_type {
>        */
>       RTE_ETH_EVENT_RX_AVAIL_THRESH,
>       /** Port recovering from a hardware or firmware error.
> -      * If PMD supports proactive error recovery,
> -      * it should trigger this event to notify application
> -      * that it detected an error and the recovery is being started.
> -      * Upon receiving the event, the application should not invoke any 
> control path API
> -      * (such as rte_eth_dev_configure/rte_eth_dev_stop...) until receiving
> -      * RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED 
> event.
> -      * The PMD will set the data path pointers to dummy functions,
> -      * and re-set the data path pointers to non-dummy functions
> -      * before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event.
> -      * It means that the application cannot send or receive any packets
> -      * during this period.
> +      *
> +      * If PMD supports proactive error recovery, it should trigger this
> +      * event to notify application that it detected an error and the
> +      * recovery is about to start.
> +      *
> +      * Upon receiving the event, the application should not invoke any
> +      * control and data path API until receiving
> +      * RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED
> +      * event.
> +      *
> +      * Once this event is reported, the PMD will set the data path pointers
> +      * to dummy functions, and re-set the data path pointers to valid
> +      * functions before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event.
> +      *

Please use the IETF RFC conventions for wording here.
Use "should" only when it is optional. In these cases the word "must"
must be used.

        * If PMD supports proactive error recovery, it must trigger this
...


>        * @note Before the PMD reports the recovery result,
>        * the PMD may report the RTE_ETH_EVENT_ERR_RECOVERING event again,
>        * because a larger error may occur during the recovery.
>        */
>       RTE_ETH_EVENT_ERR_RECOVERING,
>       /** Port recovers successfully from the error.
> -      * The PMD already re-configured the port,
> -      * and the effect is the same as a restart operation.
> +      *
> +      * The PMD already re-configured the port:
>        * a) The following operation will be retained: (alphabetically)
>        *    - DCB configuration
>        *    - FEC configuration
> @@ -4086,6 +4089,9 @@ enum rte_eth_event_type {
>        *      (@see RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP)
>        * c) Any other configuration will not be stored
>        *    and will need to be re-configured.
> +      *
> +      * The application should restore some additional configuration
> +      * (see above case b/c), and then enable data path API invocation.
>        */
>       RTE_ETH_EVENT_RECOVERY_SUCCESS,
>       /** Port recovery failed.
> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
> index 1669055ca5..da592b63bc 100644
> --- a/lib/ethdev/version.map
> +++ b/lib/ethdev/version.map
> @@ -346,6 +346,7 @@ INTERNAL {
>       rte_eth_devices;
>       rte_eth_dma_zone_free;
>       rte_eth_dma_zone_reserve;
> +     rte_eth_fp_ops_setup;
>       rte_eth_hairpin_queue_peer_bind;
>       rte_eth_hairpin_queue_peer_unbind;
>       rte_eth_hairpin_queue_peer_update;

My other concern is that changing fp_ops on a running port is not safe.
No part of eth_dev_fp_ops_setup() is atomic.

Reply via email to