Re: PoCo::Client::Ping intermittently fails to emit pings

Rocco Caputo Mon, 20 Apr 2009 12:36:28 -0700

On Apr 15, 2009, at 07:59, Stuart Kendrick wrote:

my favorite pinging application (in-house app, usesPOE::Component::Client::Ping) intermittently reports a rash ofmissed replies. happens roughly the same time each night; thecondition lasts for several minutes

Time-based failures implicate periodic jobs. What's cron doing aroundthat time?

Details:
-  perl-5.10.0, POE 1.0.4, POE::Component::Client::Ping 1.14

-  CentOS 5.3 (2.6.18-128.el5 x86_64 x86_64 x86_64 GNU/Linux)

-  the application is pinging ~120 hosts every 30 seconds

Does the application ping them all in a burst? This could cause lostpackets. As I mentioned, ICMP isn't reliable, and congestion on thenetwork or interface could cause silent packet drops.

- the box gets busy during this time, in terms of disk I/O (averagelatency of~100ms ... fifteen minute rolling load average spikes to ~18:likely stemsfrom several disk-intensive jobs which run during this period).this
  busyness lasts for ~5 hours

Heavy load could be causing your monitor program to lag, triggeringtimeouts---possibly before the ICMP requests are sent. Reducing thepinger's niceness slightly may help, but you may just be overloadingthe system. See what the machine needs to handle the work you'vegiven it. Consider distributing the work across multiple machines.

- the box hosts a number of pinging applications (Nagios, a bunchof in-house
  apps), typically employing fping

Probably not an issue, unless they stream an awful lot of requests.Load spikes may push the pinging activity together, which could resultin lost packets. For example, 120 pings over 30 seconds is fine, buta lag spike might delay them. The lagged ones will try to catch up ina burst when the lag spike ends---or when the OS says your pingershave been CPU starved long enough.

- i've been running the application for a handful of years now, ahandful ofinstances, monitoring ~100 - 500 hosts per instance. this isnovel behavior

Novel behavior suggests the question "What has changed?" The variablecould be something new and significant or something that has graduallyapproached a capacity threshold. Some things to think about: Has thenumber of hosts to ping increased by a lot recently or steadily overtime? Have the disk-intensive jobs' workload increased by a lotrecently or steadily over time?


--
Rocco Caputo - [email protected]

Re: PoCo::Client::Ping intermittently fails to emit pings

Reply via email to