[prometheus-users] Re: Weird node_exporter network metrics behaviour - NIC problem?

'Brian Candler' via Prometheus Users Tue, 16 Jan 2024 06:20:29 -0800

I would suspect due to how the counters are incremented and the new values 
published.


Suppose in the NIC's API new counter values are published at some odd 
interval like every 0.9 seconds. Your 15 second scrape will sometimes see 
the results of 16 increments from the previous counter, and sometimes 17 
increments.

It's just a guess, but it's the sort of thing that can cause such artefacts.

On Tuesday 16 January 2024 at 21:55:06 UTC+8 Dito Windyaksa wrote:

> You're right - it's related to our irate query. We tried switching to 
> rate() and it gives us a straight linear line during iperf tests.
>
> We've been using irate for years across dozens of servers, but we've only 
> noticed 'weird drops'/instability samples on this single server.
>
> We don't see any drops during iperf tests using irate query on other 
> servers.
>
> Any clues why? NIC related?
>
>
> On Monday, January 15, 2024 at 7:24:46 PM UTC+8 Bryan Boreham wrote:
>
>> I would recommend you stop using irate().
>> With 4 samples per minute, irate(...[1m]) discards half your 
>> information.  This can lead to artefacts.
>>
>> There is probably some instability in the underlying samples, which is 
>> worth investigating. 
>> An *instant* query like 
>> node_network_transmit_bytes_total{instance="xxx:9100", device="eno1"}[10m] 
>> will give the real, un-sampled, counts.
>>
>>
>> On Monday 15 January 2024 at 01:02:59 UTC mor...@gmail.com wrote:
>>
>>> Yup - both are running under the same scrape interval (15s) and using 
>>> the same irate query:
>>> irate(node_network_transmit_bytes_total{instance="xxx:9100", 
>>> device="eno1"}[1m])*8
>>>
>>> It's an iperf test between each other and no interval argument is set 
>>> (default zero.)
>>>
>>> I wonder if it has something to do with how Broadcom reports network 
>>> stats to /proc/net/dev?
>>> On Monday, January 15, 2024 at 7:49:35 AM UTC+8 Alexander Wilke wrote:
>>>
>>>> Do you have the same scrape_interval for both machines?
>>>> Are you running irate on both queties or "rate" on the one and "irate" 
>>>> on the other?
>>>> Are the iperf intervals the same for both tests?
>>>>
>>>> Dito Windyaksa schrieb am Montag, 15. Januar 2024 um 00:02:26 UTC+1:
>>>>
>>>>> Hi,
>>>>>
>>>>> We're migrating to a new bare metal provider and noticed that the 
>>>>> network metrics doesnt add up.
>>>>>
>>>>> We conducted an iperf test between A and B, and noticed there are 
>>>>> "drops" on the new machine during an ongoing iperf test.
>>>>>
>>>>> We also did not see any bandwidth drops from both iperf server/client 
>>>>> side.
>>>>>
>>>>> [image: Screenshot 2024-01-13 at 06.27.43.png]
>>>>>
>>>>> Both are running similar queries:
>>>>> irate(node_network_receive_bytes_total{instance="xxx", 
>>>>> device="eno1"}[1m])*8
>>>>>
>>>>> One thing is certain: green line machine is running an Intel 10G NIC, 
>>>>> while blue line machine is running an Broadcom 10G NIC.
>>>>>
>>>>> Any ideas?
>>>>> Dito
>>>>>
>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a1a02c4c-d7a5-4c73-857d-c2c085249c06n%40googlegroups.com.

[prometheus-users] Re: Weird node_exporter network metrics behaviour - NIC problem?

Reply via email to