On 1/11/24 09:16, Eelco Chaudron wrote: > > > On 11 Jan 2024, at 0:34, Eelco Chaudron wrote: > >> On 10 Jan 2024, at 20:34, Ilya Maximets wrote: >> >>> On 1/10/24 12:25, Eelco Chaudron wrote: >>>> Martin Kennelly observes that even though this data is available to >>>> humans via the journal/log files, these aren't exactly easy for a >>>> developer to make any kind of behavioral inferences. This kind of >>>> log and counter would be useful when checking on system health to >>>> let us know that an Open vSwitch component is noticing some kind of >>>> system level hiccup. >>>> >>>> Add a new coverage counter to track information on these events, and >>>> let a developer or system engineer know how long these events have >>>> occurred with some historical context. >>>> >>>> Reported-at: >>>> https://lists.linuxfoundation.org/pipermail/ovs-discuss/2023-June/052523.html >>>> Suggested-by: Martin Kennelly <[email protected]> >>>> Co-Authored-By: Aaron Conole <[email protected]> >>>> Signed-off-by: Aaron Conole <[email protected]> >>>> Signed-off-by: Eelco Chaudron <[email protected]> >>>> Acked-by: Simon Horman <[email protected]> >>>> --- >>>> v2: Updated commit message, based on similar patch upstream. >>>> >>>> lib/timeval.c | 4 ++++ >>>> 1 file changed, 4 insertions(+) >>>> >>>> diff --git a/lib/timeval.c b/lib/timeval.c >>>> index 193c7bab1..0abe7e555 100644 >>>> --- a/lib/timeval.c >>>> +++ b/lib/timeval.c >>>> @@ -41,6 +41,8 @@ >>>> >>>> VLOG_DEFINE_THIS_MODULE(timeval); >>>> >>>> +COVERAGE_DEFINE(long_poll_interval); >>>> + >>>> #if !defined(HAVE_CLOCK_GETTIME) >>>> typedef unsigned int clockid_t; >>>> static int clock_gettime(clock_t id, struct timespec *ts); >>>> @@ -644,6 +646,8 @@ log_poll_interval(long long int last_wakeup) >>>> const struct rusage *last_rusage = get_recent_rusage(); >>>> struct rusage rusage; >>>> >>>> + COVERAGE_INC(long_poll_interval); >>>> + >>>> if (!getrusage_thread(&rusage)) { >>>> VLOG_WARN("Unreasonably long %lldms poll interval" >>>> " (%lldms user, %lldms system)", >>> >>> Potentially interesting side effect of this change would be that >>> coverage counters will likely be logged on every long poll interval, >>> since every long poll interval will change them. Before this change >>> many coverage dumps are skipped, because they are the same (see the >>> coverage_log() function). Did you notice such a behavior change? >> >> I did not notice this, as I only forced a single instance, but this is a >> good catch. I could exclude this new counter from the coverage_hash() >> function. However, there will probably be a lot of malloc ones anyway. What >> do you think? > > Ok, after adding some code to exclude this new event in the hash. I did not > see any difference in behavior. The dump will happen each time the total > values are updated (around every 5 seconds) due to the fact seq_change and > util_xalloc are constantly being increased. The only thing that stops this > from happening all the time in my setup is the rate-limiting in > coverage_hash(). > > So I guess this change does not change anything to the current behavior, so > adding the exclude in the hash makes no sense. Thoughts?
Yeah, I don't think exclusion is a good thing. We should just keep as is. I think, the main thing that prevents logging is also the fact that hashing doesn't hash the actual counters (I missed that yesterday), but the ranks of each counter, i.e. if the relative order of all the counters sorted by the total count didn't change, there will be no log. And I agree that xalloc counter is basically increased all the time anyway. Should be fine as-is. Thanks for checking! Acked-by: Ilya Maximets <[email protected]> _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
