my favorite pinging application (in-house app, uses POE::Component::Client::Ping) intermittently reports a rash of missed replies. happens roughly the same time each night; the condition lasts for several minutes

i sat up and watched it last night -- put my code into debug mode (so that it logs each hit & missed ping) and ran a sniffer

(1) i can see from debug output the rash of "no response" messages

(2) i can see from the packet trace that the box did not emit any ICMP Echoes across the window during which my code is complaining about "no response"

(3) this box runs a bunch of pinging apps ... i can see ICMP Destination Unreachable responses trickling back to another application (which uses fping) during the relevant window (a handful per second)


have any pointers on how one might debug such a condition? i suspect that the root cause is OS-related, rather than POE-related ... what condition would interfere with Linux's ability to emit pings?


Details:
-  perl-5.10.0, POE 1.0.4, POE::Component::Client::Ping 1.14

-  CentOS 5.3 (2.6.18-128.el5 x86_64 x86_64 x86_64 GNU/Linux)

-  the application is pinging ~120 hosts every 30 seconds

-  the box gets busy during this time, in terms of disk I/O (average latency of
   ~100ms ... fifteen minute rolling load average spikes to ~18:  likely stems
   from several disk-intensive jobs which run during this period).  this
   busyness lasts for ~5 hours

-  the box hosts a number of pinging applications (Nagios, a bunch of in-house
   apps), typically employing fping

-  i've been running the application for a handful of years now, a handful of
   instances, monitoring ~100 - 500 hosts per instance.  this is novel behavior


--sk

stuart kendrick
fred hutchinson cancer research center
seattle, wa usa

Reply via email to