On 09/12/09 06:06 PM, Jonathan Call wrote: > I recently added two new slaves to a distributed Nagios system. The > central server now passively processes 17,000+ service checks on 3000+ > servers. > > It's been over an hour and a half since I brought those new slaves > online and I have about 150 hosts still stuck in 'Pending' and about > 1300 services in the same state. In addition to that it seems that the > service check results from the other slaves that were working normally > are now arbitrarily disappearing. For example, on one host three of the > service checks have been updated relatively recently (i.e. 5-30 minutes > ago) but three other service checks haven't been updated for almost an > hour. The slaves all appear operational and the hosts are being checked > on time. Is it possible I've overwhelmed Nagios' ability to process data > from the NSCA daemon or struck some internal Nagios bottleneck? Any > suggestions would be appreciated.
Hummmm Very interesting. Which Nagios version are you using? This sounds a lot like a problem I encountered a few years ago with passive checks. I had about 50-60 servers returning cron-scheduled check results to the Nagios server. 120 results ain't that much, but is seemed that with all the servers fully time-synced (using NTP) out of these ~120 results I was often missing some of them, which would eventually cause false-alarm due to stale services. I could easily reproduce the problem by feeding lots of results to Nagios right when I was expecting a batch of passive results - this would cause random results to be dropped. I spent some time trying to debug this but I couldn't figure our where commands were dropped. My primary target was the ring buffer used by the command reaper. As far as I can remember I tested with version of Nagios ranging from 2.3 to 2.5; I never tried with recent version If you're running a recent version of nagios what do you get for "Used/High/Total Command Buffers" in the "nagiostats" command output? (you can also get these numbers from the web interface, "Performance Info" in the left bar.). If it seems to be maxed out, you may try setting "command_check_interval" to "-1" and raising the "external_command_buffer_slots" option in nagios.cfg. If you're still having this problem with Nagios v3 and up I might try to reproduce this as well, and maybe I'll be able to figure out what's wrong this time. -- Thomas ------------------------------------------------------------------------------ Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null