On 2 Jan 2024, at 23:10, Ilya Maximets wrote:
> On 12/21/23 09:20, Eelco Chaudron wrote: >> Add new coverage counters that might help debugging flow >> dump length related issues. >> >> Signed-off-by: Eelco Chaudron <[email protected]> >> --- >> ofproto/ofproto-dpif-upcall.c | 13 ++++++++++--- >> 1 file changed, 10 insertions(+), 3 deletions(-) >> >> diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c >> index cd71e3ee3..85f78675a 100644 >> --- a/ofproto/ofproto-dpif-upcall.c >> +++ b/ofproto/ofproto-dpif-upcall.c >> @@ -52,6 +52,9 @@ >> >> VLOG_DEFINE_THIS_MODULE(ofproto_dpif_upcall); >> >> +COVERAGE_DEFINE(dump_duration_low); >> +COVERAGE_DEFINE(dump_duration_medium); >> +COVERAGE_DEFINE(dump_duration_high); >> COVERAGE_DEFINE(dumped_duplicate_flow); >> COVERAGE_DEFINE(dumped_inconsistent_flow); >> COVERAGE_DEFINE(dumped_new_flow); >> @@ -1039,11 +1042,15 @@ udpif_revalidator(void *arg) >> udpif->dump_duration = duration; >> if (duration > 2000) { >> flow_limit /= duration / 1000; >> + COVERAGE_INC(dump_duration_high); >> } else if (duration > 1300) { >> flow_limit = flow_limit * 3 / 4; >> - } else if (duration < 1000 && >> - flow_limit < n_flows * 1000 / duration) { >> - flow_limit += 1000; >> + COVERAGE_INC(dump_duration_medium); >> + } else if (duration < 1000) { >> + if (flow_limit < n_flows * 1000 / duration) { >> + flow_limit += 1000; >> + } >> + COVERAGE_INC(dump_duration_low); >> } > > Hi, Eelco. Thanks for the change! > Though I'm not sure we actually need all three of these. > > The main concern being a low counter. In my experience, counters that > just continuously go up create a lot of confusion and panic among users, > who do not fully understand what they mean. For example the memory > allocation counter; nobody really knows what is the good or bad value > or a grow rate for it, it just always goes up at a fairly high rate. Guess it’s time to add documentation around coverage counters. I can add an initial framework mentioning the existing ones, and add a clear description for the new ones :) Of course, stating that these are debug counters and are subject to change without notice ;) > And the 'dump_duration_low' counter name doesn't help much here, because > something being low can mean either a bad or a good thing. > E.g. low battery or something like that. In practice, I'm not sure if > there is a lot of value of having this counter as it just tells how many > revalidation cycles we had, which is not a very meaningful number either > since every cycle can be different and can take different amount of time. > I'd suggest not having this counter. > > For the high and medium, I'm also not sure why we need to distinguish > them. Both mean that something is not good, and both can't actually > tell what exactly is wrong. For the emphasis we have a warning in the > log now that will differentiate the high from the medium, if necessary. > > Maybe we can go with a single counter with a more understandable name > like 'upcall_flow_limit_reduced' ? Adding to a family of other flow > limit counters - upcall_flow_limit_hit and upcall_flow_limit_kill. > > What do you think? I guess my goal here was to use the counters to characterize the revalidation process in case of problems. With all three of these counters, we can see how often we have medium and long compared to the short ones. We also can differentiate between moderate and extreme flow reductions. This will give us a counter for the log message you refer to, which might have been lost by the time we need to investigate issues. Although my preference was to keep three individual coverage counters, with good documentation, and maybe rename them to be more self-explanatory, i.e., dump_duration_lt_1000, dump_duration_gt_2000, dump_duration_gt_1300 (_lt_2000). I realized there is the [1000..1300] range missing a count so getting the percentage will not work :( I still want to differentiate the bad from the really bad, as the log message might get lost. So what about the following diff: --- a/ofproto/ofproto-dpif-upcall.c +++ b/ofproto/ofproto-dpif-upcall.c @@ -60,7 +60,10 @@ COVERAGE_DEFINE(revalidate_missed_dp_flow); COVERAGE_DEFINE(ukey_dp_change); COVERAGE_DEFINE(ukey_invalid_stat_reset); COVERAGE_DEFINE(upcall_flow_limit_hit); +COVERAGE_DEFINE(upcall_flow_limit_increased_1k); COVERAGE_DEFINE(upcall_flow_limit_kill); +COVERAGE_DEFINE(upcall_flow_limit_reduced_div_1k); +COVERAGE_DEFINE(upcall_flow_limit_reduced_min_1qtr); COVERAGE_DEFINE(upcall_ukey_contention); COVERAGE_DEFINE(upcall_ukey_replace); @@ -1039,11 +1042,14 @@ udpif_revalidator(void *arg) udpif->dump_duration = duration; if (duration > 2000) { flow_limit /= duration / 1000; + COVERAGE_INC(upcall_flow_limit_reduced_div_1k); } else if (duration > 1300) { flow_limit = flow_limit * 3 / 4; + COVERAGE_INC(upcall_flow_limit_reduced_min_1qtr); } else if (duration < 1000 && flow_limit < n_flows * 1000 / duration) { flow_limit += 1000; + COVERAGE_INC(upcall_flow_limit_increased_1k); } flow_limit = MIN(ofproto_flow_limit, MAX(flow_limit, 1000)); atomic_store_relaxed(&udpif->flow_limit, flow_limit); //Eelco _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
