[prometheus-users] How to monitor node-exporter's network traffic

Florin Jurcovici Tue, 12 Jan 2021 09:05:03 -0800

We run prometheus deployed via prometheus-operator in a kubernetes cluster. 
Amont other things, a node_exporter pod is deployed on each node.

Once in a while, but quite frequently to be damaging, the node_exporter pod
on one node starts using up huge amounts of bandwidth - ten times or so
more than any other node_exporter pod in the cluster, making up for most of
the bandiwdth (both in and out) of the node, even if more than 100 other
pods are running on the same node. Deleting the pod doesn't help. The only
thing helping is draining the node and adding a new node instead.

The node_exporter pod itself behaves nicely, even with that high memory
bandwidth. Only, with the network on that node under heavy load, health
checks start failing due to timeouts, and kubernetes starts continuously
restarting pods which are in fact healthy. The node never recovers.

In order to diagnose the problem, I'd need to be able to determine what
that huge amount of traffic in the node_exporter pod consists of. How can I
do that? My go-to approach would be tcpdump, but that pod runs the
node_exporter as nobody, therefore any shell into the pod runs as nobody.
That user can't do any of the operations required to allow tcpdump to run
as non-root. I'm not aware of any other method to monitor network traffic
at the pod level.

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/f06def04-5f79-479e-be66-51d988bb4bb0n%40googlegroups.com.

[prometheus-users] How to monitor node-exporter's network traffic

Reply via email to