On 23 May 2024, at 12:46, Roi Dayan via dev wrote:

> It is observed in some environments that there are much more ukeys than
> actual DP flows. For example:
>
> $ ovs-appctl upcall/show
> system@ovs-system:
> flows : (current 7) (avg 6) (max 117) (limit 2125)
> offloaded flows : 525
> dump duration : 1063ms
> ufid enabled : true
>
> 23: (keys 3612)
> 24: (keys 3625)
> 25: (keys 3485)
>
> The revalidator threads are busy revalidating the stale ukeys leading to
> high CPU and long dump duration.
>
> There are some possible situations that may result in stale ukeys that
> have no corresponding DP flows.
>
> In revalidator, push_dp_ops() doesn't check error if the op type is not
> DEL. It is possible that a PUT(MODIFY) fails, especially for tc offload
> case, where the old flow is deleted first and then the new one is
> created. If the creation fails, the ukey will be stale (no corresponding
> DP flow). This patch adds a warning in such case.

Not sure I understand, this behavior should be captured by the 
UKEY_INCONSISTENT state.

> Another possible scenario is in handle_upcalls() if a PUT operation did
> not succeed and op->error attribute was not set correctly it can lead to
> stale ukey in operational state.


Guess we need to figure out these cases, as these are the root cause of your 
problem.

>
> This patch adds checks in the sweep phase for such ukeys and move them
> to DELETE so that they can be cleared eventually.
>
> Co-authored-by: Han Zhou <[email protected]>
> Signed-off-by: Han Zhou <[email protected]>
> Signed-off-by: Roi Dayan <[email protected]>
> ---
>  ofproto/ofproto-dpif-upcall.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
> index 83609ec62b63..e9520ebdf910 100644
> --- a/ofproto/ofproto-dpif-upcall.c
> +++ b/ofproto/ofproto-dpif-upcall.c
> @@ -57,6 +57,7 @@ COVERAGE_DEFINE(dumped_inconsistent_flow);
>  COVERAGE_DEFINE(dumped_new_flow);
>  COVERAGE_DEFINE(handler_duplicate_upcall);
>  COVERAGE_DEFINE(revalidate_missed_dp_flow);
> +COVERAGE_DEFINE(revalidate_missed_dp_flow_del);
>  COVERAGE_DEFINE(ukey_dp_change);
>  COVERAGE_DEFINE(ukey_invalid_stat_reset);
>  COVERAGE_DEFINE(ukey_replace_contention);
> @@ -278,6 +279,7 @@ enum flow_del_reason {
>      FDR_BAD_ODP_FIT,        /* Bad ODP flow fit. */
>      FDR_FLOW_IDLE,          /* Flow idle timeout. */
>      FDR_FLOW_LIMIT,         /* Kill all flows condition reached. */
> +    FDR_FLOW_STALE,         /* Flow stale detected. */

Please also update the associated script, see above comment.

>      FDR_FLOW_WILDCARDED,    /* Flow needs a narrower wildcard mask. */
>      FDR_NO_OFPROTO,         /* Bridge not found. */
>      FDR_PURGE,              /* User requested flow deletion. */
> @@ -2557,6 +2559,10 @@ push_dp_ops(struct udpif *udpif, struct ukey_op *ops, 
> size_t n_ops)


The (op->dop.error) condition below will never be reached, as it’s explicitly 
checked above. So the error message will never happen. The condition you 
explain is already covered by UKEY_INCONSISTENT. Not sure if we need a log 
message for this either at warn level, maybe debug, especially as you say it 
can happen often.

>          if (op->dop.type != DPIF_OP_FLOW_DEL) {
>              /* Only deleted flows need their stats pushed. */
> +            if (op->dop.error) {
> +                VLOG_WARN_RL(&rl, "push_dp_ops: error %d in op type %d, ukey 
> "
> +                             "%p", op->dop.error, op->dop.type, op->ukey);
> +            }
>              continue;
>          }
>
> @@ -3027,6 +3033,15 @@ revalidator_sweep__(struct revalidator *revalidator, 
> bool purge)
>                      del_reason = purge ? FDR_PURGE : FDR_UPDATE_FAIL;
>                  } else if (!seq_mismatch) {
>                      result = UKEY_KEEP;
> +                } else if (!ukey->stats.used &&
> +                           udpif_flow_time_delta(udpif, ukey) * 1000 >
> +                           ofproto_max_idle) {

In theory, this is a missed dp flow, i.e. the same as the else {} case below.
Maybe we should change revalidate_ukey() to have a test for this?


Maybe something like this (not tested):

diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
index e9520ebdf..c34978fc9 100644
--- a/ofproto/ofproto-dpif-upcall.c
+++ b/ofproto/ofproto-dpif-upcall.c
@@ -279,7 +279,7 @@ enum flow_del_reason {
     FDR_BAD_ODP_FIT,        /* Bad ODP flow fit. */
     FDR_FLOW_IDLE,          /* Flow idle timeout. */
     FDR_FLOW_LIMIT,         /* Kill all flows condition reached. */
-    FDR_FLOW_STALE,         /* Flow stale detected. */
+    FDR_FLOW_NO_STATS_IDLE, /* Flow idled out without receiving statistics. */
     FDR_FLOW_WILDCARDED,    /* Flow needs a narrower wildcard mask. */
     FDR_NO_OFPROTO,         /* Bridge not found. */
     FDR_PURGE,              /* User requested flow deletion. */
@@ -2452,7 +2452,14 @@ revalidate_ukey(struct udpif *udpif, struct udpif_key 
*ukey,
         log_unexpected_stats_jump(ukey, stats);
     }

-    if (need_revalidate) {
+    if (!ukey->stats.used
+        && ukey->created < udpif->dpif->current_ms - ofproto_max_idle) {
+        /* If the flow has a 'used' value of 0, meaning no stats were received
+         * for this flow, and the configured idle time has elapsed, it might
+         * indicates a stale flow (i.e., a flow without an installed datapath
+         * rule). In this case, it is safe to mark this ukey for deletion. */
+        *del_reason = FDR_FLOW_NO_STATS_IDLE;
+    } else if (need_revalidate) {
         if (should_revalidate(udpif, ukey, push.n_packets)) {
             if (!ukey->xcache) {
                 ukey->xcache = xlate_cache_new();


Others, thoughts?

> +                    COVERAGE_INC(revalidate_missed_dp_flow_del);
> +                    VLOG_WARN_RL(&rl, "revalidator_sweep__: Remove stale 
> ukey "
> +                                 "%p delta %llus", ukey,
> +                                 udpif_flow_time_delta(udpif, ukey));
> +                    result = UKEY_DELETE;
> +                    del_reason = FDR_FLOW_STALE;
>                  } else {
>                      struct dpif_flow_stats stats;
>                      COVERAGE_INC(revalidate_missed_dp_flow);
> -- 
> 2.21.0
>
> _______________________________________________
> dev mailing list
> [email protected]
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to