On Apr 15, 2009, at 07:59, Stuart Kendrick wrote:

my favorite pinging application (in-house app, uses POE::Component::Client::Ping) intermittently reports a rash of missed replies. happens roughly the same time each night; the condition lasts for several minutes

Time-based failures implicate periodic jobs. What's cron doing around that time?

Details:
-  perl-5.10.0, POE 1.0.4, POE::Component::Client::Ping 1.14

-  CentOS 5.3 (2.6.18-128.el5 x86_64 x86_64 x86_64 GNU/Linux)

-  the application is pinging ~120 hosts every 30 seconds

Does the application ping them all in a burst? This could cause lost packets. As I mentioned, ICMP isn't reliable, and congestion on the network or interface could cause silent packet drops.

- the box gets busy during this time, in terms of disk I/O (average latency of ~100ms ... fifteen minute rolling load average spikes to ~18: likely stems from several disk-intensive jobs which run during this period). this
  busyness lasts for ~5 hours

Heavy load could be causing your monitor program to lag, triggering timeouts---possibly before the ICMP requests are sent. Reducing the pinger's niceness slightly may help, but you may just be overloading the system. See what the machine needs to handle the work you've given it. Consider distributing the work across multiple machines.

- the box hosts a number of pinging applications (Nagios, a bunch of in-house
  apps), typically employing fping

Probably not an issue, unless they stream an awful lot of requests. Load spikes may push the pinging activity together, which could result in lost packets. For example, 120 pings over 30 seconds is fine, but a lag spike might delay them. The lagged ones will try to catch up in a burst when the lag spike ends---or when the OS says your pingers have been CPU starved long enough.

- i've been running the application for a handful of years now, a handful of instances, monitoring ~100 - 500 hosts per instance. this is novel behavior

Novel behavior suggests the question "What has changed?" The variable could be something new and significant or something that has gradually approached a capacity threshold. Some things to think about: Has the number of hosts to ping increased by a lot recently or steadily over time? Have the disk-intensive jobs' workload increased by a lot recently or steadily over time?

--
Rocco Caputo - [email protected]

Reply via email to