We run prometheus deployed via prometheus-operator in a kubernetes cluster. Amont other things, a node_exporter pod is deployed on each node.
Once in a while, but quite frequently to be damaging, the node_exporter pod on one node starts using up huge amounts of bandwidth - ten times or so more than any other node_exporter pod in the cluster, making up for most of the bandiwdth (both in and out) of the node, even if more than 100 other pods are running on the same node. Deleting the pod doesn't help. The only thing helping is draining the node and adding a new node instead. The node_exporter pod itself behaves nicely, even with that high memory bandwidth. Only, with the network on that node under heavy load, health checks start failing due to timeouts, and kubernetes starts continuously restarting pods which are in fact healthy. The node never recovers. In order to diagnose the problem, I'd need to be able to determine what that huge amount of traffic in the node_exporter pod consists of. How can I do that? My go-to approach would be tcpdump, but that pod runs the node_exporter as nobody, therefore any shell into the pod runs as nobody. That user can't do any of the operations required to allow tcpdump to run as non-root. I'm not aware of any other method to monitor network traffic at the pod level. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f06def04-5f79-479e-be66-51d988bb4bb0n%40googlegroups.com.

