I had a similar problem to this. I only wanted to know if a not-so-important device had been down for an hour or more.
Here's what I ended up doing: I disabled the host check (by having it call an "always-ok" checkcommand that always returns 0.) I then added a 'PING' service to the host with a max_check_attempts of 7, and a retry_check_interval of 10 minutes. The pitfall being that I no longer receive 'HOST DOWN' alerts for that host. I instead receive alerts for a failing 'PING' service. -Aaron Andrew Cruse wrote: > I've got an interesting problem with a particular setup. I'm monitoring a > number of servers that the main Nagios installation doesn't have direct > network access to, so I pass all of the host and service checks through an > NRPE installation that can communicate with both Nagios and the servers > being monitored. A little tweaking with check timeouts and whatnot and this > setup works pretty nicely. I've run into a problem where for some reason, > the NRPE server periodically stops responding to NRPE requests. Haven't > gotten to the bottom of that (Connection refused) yet. Service checks are > able to handle the problem fine as the duration of the NRPE outage is much > shorter than the time it takes for the services to go into a hard critical > state. The problem is, once the first service check goes through and goes > into a soft critical state, that triggers the host checks which also fail > (host checks go through NRPE as well) and immediately generate a > notification. I'd like to find a way to make the host checks a little more > forgiving as well. > > A few things I've thought of or tried: > > 1. I tried bumping up the host check retries to 30, but since the checks > immediately fail with "connection refused" it runs through all 30 tries > within just a few seconds. I also worry about this leading to unneeded load > on the Nagios server since this is generally going to cause check_nrpe to be > run 30 times, for each of the ~20 servers in this setup. > > 2. Extending the timeout on the check_nrpe commands doesn't help because > "connection refused" is returned immediately. > > 3. Switching to a passive setup is probably the way to go, but for now am > trying to avoid all the reconfiguration needed to move in that direction. > > > Ideally what I'd like to be able to do is have the host checks retry on a > particular interval (i.e. once per second) rather than instantly after the > previous executed. Is there a way to do this? > > Incidentally, while typing up this email I was actually able to find the > root problem with the NRPE setup. NRPE was being called via Xinetd which > wasn't configured to allow enough simultaneous connections for a single > service. Thus when it started getting hammered with NRPE requests as a > result of the host check configuration it would stop allowing NRPE > connections for 30 seconds. A quick change to the Xinetd config file seems > to have solved the problem. > > I'm still interested to know how anyone handles the situation where a host > may be unresponsive to host checks for a period of time yet you only wish to > fire off a notification after a specific period of time. Would a wrapper > around the host check be the only way to handle it? > > Andrew > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > Nagios-users mailing list > Nagios-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nagios-users > ::: Please include Nagios version, plugin version (-v) and OS when > reporting any issue. > ::: Messages without supporting info will risk being sent to /dev/null > ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null