On 11 Jan 2024, at 0:34, Eelco Chaudron wrote:
> On 10 Jan 2024, at 20:34, Ilya Maximets wrote:
>
>> On 1/10/24 12:25, Eelco Chaudron wrote:
>>> Martin Kennelly observes that even though this data is available to
>>> humans via the journal/log files, these aren't exactly easy for a
>>> developer to make any kind of behavioral inferences. This kind of
>>> log and counter would be useful when checking on system health to
>>> let us know that an Open vSwitch component is noticing some kind of
>>> system level hiccup.
>>>
>>> Add a new coverage counter to track information on these events, and
>>> let a developer or system engineer know how long these events have
>>> occurred with some historical context.
>>>
>>> Reported-at:
>>> https://lists.linuxfoundation.org/pipermail/ovs-discuss/2023-June/052523.html
>>> Suggested-by: Martin Kennelly <[email protected]>
>>> Co-Authored-By: Aaron Conole <[email protected]>
>>> Signed-off-by: Aaron Conole <[email protected]>
>>> Signed-off-by: Eelco Chaudron <[email protected]>
>>> Acked-by: Simon Horman <[email protected]>
>>> ---
>>> v2: Updated commit message, based on similar patch upstream.
>>>
>>> lib/timeval.c | 4 ++++
>>> 1 file changed, 4 insertions(+)
>>>
>>> diff --git a/lib/timeval.c b/lib/timeval.c
>>> index 193c7bab1..0abe7e555 100644
>>> --- a/lib/timeval.c
>>> +++ b/lib/timeval.c
>>> @@ -41,6 +41,8 @@
>>>
>>> VLOG_DEFINE_THIS_MODULE(timeval);
>>>
>>> +COVERAGE_DEFINE(long_poll_interval);
>>> +
>>> #if !defined(HAVE_CLOCK_GETTIME)
>>> typedef unsigned int clockid_t;
>>> static int clock_gettime(clock_t id, struct timespec *ts);
>>> @@ -644,6 +646,8 @@ log_poll_interval(long long int last_wakeup)
>>> const struct rusage *last_rusage = get_recent_rusage();
>>> struct rusage rusage;
>>>
>>> + COVERAGE_INC(long_poll_interval);
>>> +
>>> if (!getrusage_thread(&rusage)) {
>>> VLOG_WARN("Unreasonably long %lldms poll interval"
>>> " (%lldms user, %lldms system)",
>>
>> Potentially interesting side effect of this change would be that
>> coverage counters will likely be logged on every long poll interval,
>> since every long poll interval will change them. Before this change
>> many coverage dumps are skipped, because they are the same (see the
>> coverage_log() function). Did you notice such a behavior change?
>
> I did not notice this, as I only forced a single instance, but this is a good
> catch. I could exclude this new counter from the coverage_hash() function.
> However, there will probably be a lot of malloc ones anyway. What do you
> think?
Ok, after adding some code to exclude this new event in the hash. I did not see
any difference in behavior. The dump will happen each time the total values are
updated (around every 5 seconds) due to the fact seq_change and util_xalloc are
constantly being increased. The only thing that stops this from happening all
the time in my setup is the rate-limiting in coverage_hash().
So I guess this change does not change anything to the current behavior, so
adding the exclude in the hash makes no sense. Thoughts?
//Eelco
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev