We run prometheus deployed via prometheus-operator in a kubernetes cluster. 
Amont other things, a node_exporter pod is deployed on each node.

Once in a while, but quite frequently to be damaging, the node_exporter pod 
on one node starts using up huge amounts of bandwidth - ten times or so 
more than any other node_exporter pod in the cluster, making up for most of 
the bandiwdth (both in and out) of the node, even if more than 100 other 
pods are running on the same node. Deleting the pod doesn't help. The only 
thing helping is draining the node and adding a new node instead.

The node_exporter pod itself behaves nicely, even with that high memory 
bandwidth. Only, with the network on that node under heavy load, health 
checks start failing due to timeouts, and kubernetes starts continuously 
restarting pods which are in fact healthy. The node never recovers.

In order to diagnose the problem, I'd need to be able to determine what 
that huge amount of traffic in the node_exporter pod consists of. How can I 
do that? My go-to approach would be tcpdump, but that pod runs the 
node_exporter as nobody, therefore any shell into the pod runs as nobody. 
That user can't do any of the operations required to allow tcpdump to run 
as non-root. I'm not aware of any other method to monitor network traffic 
at the pod level.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f06def04-5f79-479e-be66-51d988bb4bb0n%40googlegroups.com.

Reply via email to