my favorite pinging application (in-house app, uses
POE::Component::Client::Ping) intermittently reports a rash of missed replies.
happens roughly the same time each night; the condition lasts for several minutes
i sat up and watched it last night -- put my code into debug mode (so that it
logs each hit & missed ping) and ran a sniffer
(1) i can see from debug output the rash of "no response" messages
(2) i can see from the packet trace that the box did not emit any ICMP Echoes
across the window during which my code is complaining about "no response"
(3) this box runs a bunch of pinging apps ... i can see ICMP Destination
Unreachable responses trickling back to another application (which uses fping)
during the relevant window (a handful per second)
have any pointers on how one might debug such a condition? i suspect that the
root cause is OS-related, rather than POE-related ... what condition would
interfere with Linux's ability to emit pings?
Details:
- perl-5.10.0, POE 1.0.4, POE::Component::Client::Ping 1.14
- CentOS 5.3 (2.6.18-128.el5 x86_64 x86_64 x86_64 GNU/Linux)
- the application is pinging ~120 hosts every 30 seconds
- the box gets busy during this time, in terms of disk I/O (average latency of
~100ms ... fifteen minute rolling load average spikes to ~18: likely stems
from several disk-intensive jobs which run during this period). this
busyness lasts for ~5 hours
- the box hosts a number of pinging applications (Nagios, a bunch of in-house
apps), typically employing fping
- i've been running the application for a handful of years now, a handful of
instances, monitoring ~100 - 500 hosts per instance. this is novel behavior
--sk
stuart kendrick
fred hutchinson cancer research center
seattle, wa usa