On 23 May 2024, at 12:46, Roi Dayan via dev wrote:
> It is observed in some environments that there are much more ukeys than > actual DP flows. For example: > > $ ovs-appctl upcall/show > system@ovs-system: > flows : (current 7) (avg 6) (max 117) (limit 2125) > offloaded flows : 525 > dump duration : 1063ms > ufid enabled : true > > 23: (keys 3612) > 24: (keys 3625) > 25: (keys 3485) > > The revalidator threads are busy revalidating the stale ukeys leading to > high CPU and long dump duration. > > There are some possible situations that may result in stale ukeys that > have no corresponding DP flows. > > In revalidator, push_dp_ops() doesn't check error if the op type is not > DEL. It is possible that a PUT(MODIFY) fails, especially for tc offload > case, where the old flow is deleted first and then the new one is > created. If the creation fails, the ukey will be stale (no corresponding > DP flow). This patch adds a warning in such case. Not sure I understand, this behavior should be captured by the UKEY_INCONSISTENT state. > Another possible scenario is in handle_upcalls() if a PUT operation did > not succeed and op->error attribute was not set correctly it can lead to > stale ukey in operational state. Guess we need to figure out these cases, as these are the root cause of your problem. > > This patch adds checks in the sweep phase for such ukeys and move them > to DELETE so that they can be cleared eventually. > > Co-authored-by: Han Zhou <[email protected]> > Signed-off-by: Han Zhou <[email protected]> > Signed-off-by: Roi Dayan <[email protected]> > --- > ofproto/ofproto-dpif-upcall.c | 15 +++++++++++++++ > 1 file changed, 15 insertions(+) > > diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c > index 83609ec62b63..e9520ebdf910 100644 > --- a/ofproto/ofproto-dpif-upcall.c > +++ b/ofproto/ofproto-dpif-upcall.c > @@ -57,6 +57,7 @@ COVERAGE_DEFINE(dumped_inconsistent_flow); > COVERAGE_DEFINE(dumped_new_flow); > COVERAGE_DEFINE(handler_duplicate_upcall); > COVERAGE_DEFINE(revalidate_missed_dp_flow); > +COVERAGE_DEFINE(revalidate_missed_dp_flow_del); > COVERAGE_DEFINE(ukey_dp_change); > COVERAGE_DEFINE(ukey_invalid_stat_reset); > COVERAGE_DEFINE(ukey_replace_contention); > @@ -278,6 +279,7 @@ enum flow_del_reason { > FDR_BAD_ODP_FIT, /* Bad ODP flow fit. */ > FDR_FLOW_IDLE, /* Flow idle timeout. */ > FDR_FLOW_LIMIT, /* Kill all flows condition reached. */ > + FDR_FLOW_STALE, /* Flow stale detected. */ Please also update the associated script, see above comment. > FDR_FLOW_WILDCARDED, /* Flow needs a narrower wildcard mask. */ > FDR_NO_OFPROTO, /* Bridge not found. */ > FDR_PURGE, /* User requested flow deletion. */ > @@ -2557,6 +2559,10 @@ push_dp_ops(struct udpif *udpif, struct ukey_op *ops, > size_t n_ops) The (op->dop.error) condition below will never be reached, as it’s explicitly checked above. So the error message will never happen. The condition you explain is already covered by UKEY_INCONSISTENT. Not sure if we need a log message for this either at warn level, maybe debug, especially as you say it can happen often. > if (op->dop.type != DPIF_OP_FLOW_DEL) { > /* Only deleted flows need their stats pushed. */ > + if (op->dop.error) { > + VLOG_WARN_RL(&rl, "push_dp_ops: error %d in op type %d, ukey > " > + "%p", op->dop.error, op->dop.type, op->ukey); > + } > continue; > } > > @@ -3027,6 +3033,15 @@ revalidator_sweep__(struct revalidator *revalidator, > bool purge) > del_reason = purge ? FDR_PURGE : FDR_UPDATE_FAIL; > } else if (!seq_mismatch) { > result = UKEY_KEEP; > + } else if (!ukey->stats.used && > + udpif_flow_time_delta(udpif, ukey) * 1000 > > + ofproto_max_idle) { In theory, this is a missed dp flow, i.e. the same as the else {} case below. Maybe we should change revalidate_ukey() to have a test for this? Maybe something like this (not tested): diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c index e9520ebdf..c34978fc9 100644 --- a/ofproto/ofproto-dpif-upcall.c +++ b/ofproto/ofproto-dpif-upcall.c @@ -279,7 +279,7 @@ enum flow_del_reason { FDR_BAD_ODP_FIT, /* Bad ODP flow fit. */ FDR_FLOW_IDLE, /* Flow idle timeout. */ FDR_FLOW_LIMIT, /* Kill all flows condition reached. */ - FDR_FLOW_STALE, /* Flow stale detected. */ + FDR_FLOW_NO_STATS_IDLE, /* Flow idled out without receiving statistics. */ FDR_FLOW_WILDCARDED, /* Flow needs a narrower wildcard mask. */ FDR_NO_OFPROTO, /* Bridge not found. */ FDR_PURGE, /* User requested flow deletion. */ @@ -2452,7 +2452,14 @@ revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey, log_unexpected_stats_jump(ukey, stats); } - if (need_revalidate) { + if (!ukey->stats.used + && ukey->created < udpif->dpif->current_ms - ofproto_max_idle) { + /* If the flow has a 'used' value of 0, meaning no stats were received + * for this flow, and the configured idle time has elapsed, it might + * indicates a stale flow (i.e., a flow without an installed datapath + * rule). In this case, it is safe to mark this ukey for deletion. */ + *del_reason = FDR_FLOW_NO_STATS_IDLE; + } else if (need_revalidate) { if (should_revalidate(udpif, ukey, push.n_packets)) { if (!ukey->xcache) { ukey->xcache = xlate_cache_new(); Others, thoughts? > + COVERAGE_INC(revalidate_missed_dp_flow_del); > + VLOG_WARN_RL(&rl, "revalidator_sweep__: Remove stale > ukey " > + "%p delta %llus", ukey, > + udpif_flow_time_delta(udpif, ukey)); > + result = UKEY_DELETE; > + del_reason = FDR_FLOW_STALE; > } else { > struct dpif_flow_stats stats; > COVERAGE_INC(revalidate_missed_dp_flow); > -- > 2.21.0 > > _______________________________________________ > dev mailing list > [email protected] > https://mail.openvswitch.org/mailman/listinfo/ovs-dev _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
