On 3 Jun 2024, at 10:07, Roi Dayan wrote:
> On 03/06/2024 10:18, Roi Dayan wrote: >> >> >> On 30/05/2024 18:48, Eelco Chaudron wrote: >>> >>> >>> On 23 May 2024, at 12:46, Roi Dayan via dev wrote: >>> >>>> It is observed in some environments that there are much more ukeys than >>>> actual DP flows. For example: >>>> >>>> $ ovs-appctl upcall/show >>>> system@ovs-system: >>>> flows : (current 7) (avg 6) (max 117) (limit 2125) >>>> offloaded flows : 525 >>>> dump duration : 1063ms >>>> ufid enabled : true >>>> >>>> 23: (keys 3612) >>>> 24: (keys 3625) >>>> 25: (keys 3485) >>>> >>>> The revalidator threads are busy revalidating the stale ukeys leading to >>>> high CPU and long dump duration. >>>> >>>> There are some possible situations that may result in stale ukeys that >>>> have no corresponding DP flows. >>>> >>>> In revalidator, push_dp_ops() doesn't check error if the op type is not >>>> DEL. It is possible that a PUT(MODIFY) fails, especially for tc offload >>>> case, where the old flow is deleted first and then the new one is >>>> created. If the creation fails, the ukey will be stale (no corresponding >>>> DP flow). This patch adds a warning in such case. >>> >>> Not sure I understand, this behavior should be captured by the >>> UKEY_INCONSISTENT state. >> >> Hi Eelco, >> >> thanks for reviewing. >> We started the patch on older branch as we didn't rebase for some time >> and got a little later on sending it. >> I see the addition now for UKEY_INCOSISTENT in the following patch: >> >> cf11766cbcf1 ofproto-dpif-upcall: Fix push_dp_ops to handle all errors. >> >> Looks like it handles the same situation we tried to handle in this patch. >> >>> >>>> Another possible scenario is in handle_upcalls() if a PUT operation did >>>> not succeed and op->error attribute was not set correctly it can lead to >>>> stale ukey in operational state. >>> >>> >>> Guess we need to figure out these cases, as these are the root cause of >>> your problem. >> >> right. maybe. but this could help keep the system alive/clean while logging >> the >> bad situation instead of having increase in those stale/inconsistent ukeys. >> I understand if it's not nice to handle it like this. >> > > Hi Eelco, > > I remember now one of the reproduction scenarios we did is do some traffic > to get rules added using TC and then delete those from tc command line > and it got to stale ukeys. > The revalidator dump thread not seeing the rules so not updating anything > and sweep over the ukeys skipping them as well. > This is why we checked against the timing stats of the ukey. > > I remember I tested on the upstream code and not our development branch > and it reproduces. I didn't notice if the commit adding UKEY_INCONSISTENT > existed but it handles errors from adding flows so I dont think it matters. > > I'll try to check and verify again but I think it's still here. > So while cases getting dop.error handled with UKEY_INCONSISTENT, > this case I think is not. I think you are right this case is not covered by the UKEY_INCONSISTENT test below. See my suggestion on fixing this in revalidate_ukey(). Maybe you could also add a test case for this in the offload test suite. //Eelco >>>> This patch adds checks in the sweep phase for such ukeys and move them >>>> to DELETE so that they can be cleared eventually. >>>> >>>> Co-authored-by: Han Zhou <[email protected]> >>>> Signed-off-by: Han Zhou <[email protected]> >>>> Signed-off-by: Roi Dayan <[email protected]> >>>> --- >>>> ofproto/ofproto-dpif-upcall.c | 15 +++++++++++++++ >>>> 1 file changed, 15 insertions(+) >>>> >>>> diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c >>>> index 83609ec62b63..e9520ebdf910 100644 >>>> --- a/ofproto/ofproto-dpif-upcall.c >>>> +++ b/ofproto/ofproto-dpif-upcall.c >>>> @@ -57,6 +57,7 @@ COVERAGE_DEFINE(dumped_inconsistent_flow); >>>> COVERAGE_DEFINE(dumped_new_flow); >>>> COVERAGE_DEFINE(handler_duplicate_upcall); >>>> COVERAGE_DEFINE(revalidate_missed_dp_flow); >>>> +COVERAGE_DEFINE(revalidate_missed_dp_flow_del); >>>> COVERAGE_DEFINE(ukey_dp_change); >>>> COVERAGE_DEFINE(ukey_invalid_stat_reset); >>>> COVERAGE_DEFINE(ukey_replace_contention); >>>> @@ -278,6 +279,7 @@ enum flow_del_reason { >>>> FDR_BAD_ODP_FIT, /* Bad ODP flow fit. */ >>>> FDR_FLOW_IDLE, /* Flow idle timeout. */ >>>> FDR_FLOW_LIMIT, /* Kill all flows condition reached. */ >>>> + FDR_FLOW_STALE, /* Flow stale detected. */ >>> >>> Please also update the associated script, see above comment. >>> >>>> FDR_FLOW_WILDCARDED, /* Flow needs a narrower wildcard mask. */ >>>> FDR_NO_OFPROTO, /* Bridge not found. */ >>>> FDR_PURGE, /* User requested flow deletion. */ >>>> @@ -2557,6 +2559,10 @@ push_dp_ops(struct udpif *udpif, struct ukey_op >>>> *ops, size_t n_ops) >>> >>> >>> The (op->dop.error) condition below will never be reached, as it’s >>> explicitly checked above. So the error message will never happen. The >>> condition you explain is already covered by UKEY_INCONSISTENT. Not sure if >>> we need a log message for this either at warn level, maybe debug, >>> especially as you say it can happen often. >>> >> >> right. as replied above. I didn't notice this when I did the cherry-pick as >> git didn't >> show me a conflict in this area at all. >> This commit was done before the change so we did hit this spot. we got >> little off >> checking the commit and sending it on time for review. sorry about that. >> cf11766cbcf1 ofproto-dpif-upcall: Fix push_dp_ops to handle all errors. >> >> I think we should redo our checks with the commit adding the >> UKEY_INCONSISTENT >> state and see our result again. >> >> thanks for the review. >> >> >>>> if (op->dop.type != DPIF_OP_FLOW_DEL) { >>>> /* Only deleted flows need their stats pushed. */ >>>> + if (op->dop.error) { >>>> + VLOG_WARN_RL(&rl, "push_dp_ops: error %d in op type %d, >>>> ukey " >>>> + "%p", op->dop.error, op->dop.type, op->ukey); >>>> + } >>>> continue; >>>> } >>>> >>>> @@ -3027,6 +3033,15 @@ revalidator_sweep__(struct revalidator >>>> *revalidator, bool purge) >>>> del_reason = purge ? FDR_PURGE : FDR_UPDATE_FAIL; >>>> } else if (!seq_mismatch) { >>>> result = UKEY_KEEP; >>>> + } else if (!ukey->stats.used && >>>> + udpif_flow_time_delta(udpif, ukey) * 1000 > >>>> + ofproto_max_idle) { >>> >>> In theory, this is a missed dp flow, i.e. the same as the else {} case >>> below. >>> Maybe we should change revalidate_ukey() to have a test for this? >>> >>> >>> Maybe something like this (not tested): >>> >>> diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c >>> index e9520ebdf..c34978fc9 100644 >>> --- a/ofproto/ofproto-dpif-upcall.c >>> +++ b/ofproto/ofproto-dpif-upcall.c >>> @@ -279,7 +279,7 @@ enum flow_del_reason { >>> FDR_BAD_ODP_FIT, /* Bad ODP flow fit. */ >>> FDR_FLOW_IDLE, /* Flow idle timeout. */ >>> FDR_FLOW_LIMIT, /* Kill all flows condition reached. */ >>> - FDR_FLOW_STALE, /* Flow stale detected. */ >>> + FDR_FLOW_NO_STATS_IDLE, /* Flow idled out without receiving >>> statistics. */ >>> FDR_FLOW_WILDCARDED, /* Flow needs a narrower wildcard mask. */ >>> FDR_NO_OFPROTO, /* Bridge not found. */ >>> FDR_PURGE, /* User requested flow deletion. */ >>> @@ -2452,7 +2452,14 @@ revalidate_ukey(struct udpif *udpif, struct >>> udpif_key *ukey, >>> log_unexpected_stats_jump(ukey, stats); >>> } >>> >>> - if (need_revalidate) { >>> + if (!ukey->stats.used >>> + && ukey->created < udpif->dpif->current_ms - ofproto_max_idle) { >>> + /* If the flow has a 'used' value of 0, meaning no stats were >>> received >>> + * for this flow, and the configured idle time has elapsed, it >>> might >>> + * indicates a stale flow (i.e., a flow without an installed >>> datapath >>> + * rule). In this case, it is safe to mark this ukey for deletion. >>> */ >>> + *del_reason = FDR_FLOW_NO_STATS_IDLE; >>> + } else if (need_revalidate) { >>> if (should_revalidate(udpif, ukey, push.n_packets)) { >>> if (!ukey->xcache) { >>> ukey->xcache = xlate_cache_new(); >>> >>> >>> Others, thoughts? >>> >>>> + COVERAGE_INC(revalidate_missed_dp_flow_del); >>>> + VLOG_WARN_RL(&rl, "revalidator_sweep__: Remove stale >>>> ukey " >>>> + "%p delta %llus", ukey, >>>> + udpif_flow_time_delta(udpif, ukey)); >>>> + result = UKEY_DELETE; >>>> + del_reason = FDR_FLOW_STALE; >>>> } else { >>>> struct dpif_flow_stats stats; >>>> COVERAGE_INC(revalidate_missed_dp_flow); >>>> -- >>>> 2.21.0 >>>> >>>> _______________________________________________ >>>> dev mailing list >>>> [email protected] >>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>> >> _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
