Let me get back to you on this in a few days - we have structured logs in 
Elasticsearch so it should be possible to answer all your questions, but 
I'll need to find a few hours to figure out how to use the tools.

Kernel is
Linux imgr1 5.3.0-62-generic #56~18.04.1-Ubuntu SMP Wed Jun 24 16:17:03 UTC 
2020 x86_64 x86_64 x86_64 GNU/Linux


On Tuesday, July 21, 2020 at 9:53:41 AM UTC+2 [email protected] wrote:

> On Mon, Jul 20, 2020 at 8:46 PM Bruce Merry <[email protected]> wrote:
>
>> On Monday, July 20, 2020 at 8:24:59 PM UTC+2 [email protected] wrote:
>>
>>> Thanks for posting this. I've been looking for more cases where people 
>>> see these issues.
>>>
>>> Linux does not do any mutex locking of the CPU metric counters. So we 
>>> added some to the node_exporter in order to detect and mitigate spurious 
>>> counter resets.
>>>
>>> In all of my testing and evidence, I've only see this happen on iowait 
>>> data. But it's interesting that you see it on other events is useful. I 
>>> don't think your docker use has any impact. I'd suspect it has more to do 
>>> with the underlying server environment.
>>>
>>
>> Looking the last week of logs, I see 41% for user, 31% for idle, 26% for 
>> system, 1.4% for iowait, 0.3% for softirq.
>>
>
> That's extremely interesting, I have only ever seen these for iowait. 
> Usually when the CPU use is moderate and there is a small background rate 
> of iowait.
>
> What is the typical CPU utilization for these nodes? Do you notice if any 
> correlation between CPUs that jump backwards and the load on that CPU at 
> that time? My question is, when a CPU jumps backwards, is it under high or 
> low utilization?
>  
>
>>  
>>
>>> Is this bare metal? VMs? What hypervisor?
>>>
>>
>> It's bare metal. What's also odd is that it's restricted to one batch of 
>> machines that have the same hardware and run the same workloads. Other 
>> machines (with different hardware and workloads) don't exhibit these 
>> warnings despite having been deployed at the same time with the same kernel 
>> and OS. The affected machines have 12 cores and hyperthreading enabled (so 
>> 24 virtual cores) while the other machines generally have up to 8 cores and 
>> no HT, so possibly the affected machines have a higher chance of running 
>> into race conditions for the kernel data structures.
>>
>> Let me know if there are other details you'd like to investigate.
>>
>
> It would be interesting to get some more details on what kernel versions 
> are there. From our other investigations, I don't recall there being much 
> change in the kernel code here, but it might be helpful.
>
> When idle jumps backwards, how much does it jump back by? What are the 
> absolute values for the counter before and after the jump back? Right now 
> we reset the counters if idle jumps back any amount, assuming this happens 
> when the kernel hotplugs a CPU. But this was a very big assumption based on 
> some limited testing. We might want to change things to only reset 
> everything if there's a jump back of more than X%.
>  
>
>>
>> Cheers
>>
> Bruce
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/f75a9bcf-0b59-4184-8191-c0da05886b38n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/prometheus-users/f75a9bcf-0b59-4184-8191-c0da05886b38n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast, a leader in email security and cyber 
resilience. Mimecast integrates email defenses with brand protection, security 
awareness training, web security, compliance and other essential capabilities. 
Mimecast helps protect large and small organizations from malicious activity, 
human error and technology failure; and to lead the movement toward building a 
more resilient world. To find out more, visit our website.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/579bc9c0-98bc-4b66-9082-2c9224968fb7n%40googlegroups.com.

Reply via email to