On Apr 15, 2009, at 07:59, Stuart Kendrick wrote:
my favorite pinging application (in-house app, uses
POE::Component::Client::Ping) intermittently reports a rash of
missed replies. happens roughly the same time each night; the
condition lasts for several minutes
Time-based failures implicate periodic jobs. What's cron doing around
that time?
Details:
- perl-5.10.0, POE 1.0.4, POE::Component::Client::Ping 1.14
- CentOS 5.3 (2.6.18-128.el5 x86_64 x86_64 x86_64 GNU/Linux)
- the application is pinging ~120 hosts every 30 seconds
Does the application ping them all in a burst? This could cause lost
packets. As I mentioned, ICMP isn't reliable, and congestion on the
network or interface could cause silent packet drops.
- the box gets busy during this time, in terms of disk I/O (average
latency of
~100ms ... fifteen minute rolling load average spikes to ~18:
likely stems
from several disk-intensive jobs which run during this period).
this
busyness lasts for ~5 hours
Heavy load could be causing your monitor program to lag, triggering
timeouts---possibly before the ICMP requests are sent. Reducing the
pinger's niceness slightly may help, but you may just be overloading
the system. See what the machine needs to handle the work you've
given it. Consider distributing the work across multiple machines.
- the box hosts a number of pinging applications (Nagios, a bunch
of in-house
apps), typically employing fping
Probably not an issue, unless they stream an awful lot of requests.
Load spikes may push the pinging activity together, which could result
in lost packets. For example, 120 pings over 30 seconds is fine, but
a lag spike might delay them. The lagged ones will try to catch up in
a burst when the lag spike ends---or when the OS says your pingers
have been CPU starved long enough.
- i've been running the application for a handful of years now, a
handful of
instances, monitoring ~100 - 500 hosts per instance. this is
novel behavior
Novel behavior suggests the question "What has changed?" The variable
could be something new and significant or something that has gradually
approached a capacity threshold. Some things to think about: Has the
number of hosts to ping increased by a lot recently or steadily over
time? Have the disk-intensive jobs' workload increased by a lot
recently or steadily over time?
--
Rocco Caputo - [email protected]