I've got an interesting problem with a particular setup. I'm monitoring a number of servers that the main Nagios installation doesn't have direct network access to, so I pass all of the host and service checks through an NRPE installation that can communicate with both Nagios and the servers being monitored. A little tweaking with check timeouts and whatnot and this setup works pretty nicely. I've run into a problem where for some reason, the NRPE server periodically stops responding to NRPE requests. Haven't gotten to the bottom of that (Connection refused) yet. Service checks are able to handle the problem fine as the duration of the NRPE outage is much shorter than the time it takes for the services to go into a hard critical state. The problem is, once the first service check goes through and goes into a soft critical state, that triggers the host checks which also fail (host checks go through NRPE as well) and immediately generate a notification. I'd like to find a way to make the host checks a little more forgiving as well.
A few things I've thought of or tried: 1. I tried bumping up the host check retries to 30, but since the checks immediately fail with "connection refused" it runs through all 30 tries within just a few seconds. I also worry about this leading to unneeded load on the Nagios server since this is generally going to cause check_nrpe to be run 30 times, for each of the ~20 servers in this setup. 2. Extending the timeout on the check_nrpe commands doesn't help because "connection refused" is returned immediately. 3. Switching to a passive setup is probably the way to go, but for now am trying to avoid all the reconfiguration needed to move in that direction. Ideally what I'd like to be able to do is have the host checks retry on a particular interval (i.e. once per second) rather than instantly after the previous executed. Is there a way to do this? Incidentally, while typing up this email I was actually able to find the root problem with the NRPE setup. NRPE was being called via Xinetd which wasn't configured to allow enough simultaneous connections for a single service. Thus when it started getting hammered with NRPE requests as a result of the host check configuration it would stop allowing NRPE connections for 30 seconds. A quick change to the Xinetd config file seems to have solved the problem. I'm still interested to know how anyone handles the situation where a host may be unresponsive to host checks for a period of time yet you only wish to fire off a notification after a specific period of time. Would a wrapper around the host check be the only way to handle it? Andrew ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null