So is this just something I'll have to live with? I don't seem to be getting much feedback on the subject. :( ----------------------------------------------- Israel Brewster Computer Support Technician II Frontier Flying Service Inc. 5245 Airport Industrial Rd Fairbanks, AK 99709 (907) 450-7250 x293 -----------------------------------------------
On Apr 6, 2009, at 10:55 AM, Israel Brewster wrote: > On Apr 6, 2009, at 9:03 AM, Giorgio Zarrelli wrote: > >> Hi, >> >> I've not quite clear what happens to you, > > Thanks for the response. For clarification, the exact sequence of > events is as follows: > > 1) The link between the nagios box and one of our routers, which we > will refer to as the parent host glitches for 30 seconds or so. Due to > the nature of the link (satellite connection) this is semi-expected, > and happens a couple of times a day. > > 2) Nagios catches this glitch in one of it's regularly scheduled host > checks, and puts the parent host into a soft down state. Again, normal > and expected - even good. > > 3) At the same time, Nagios puts the children of the parent host into > an "unreachable" state. Makes sense, at least, but leads to the issue > > 4) The parent host is now in recheck mode (as it is only in a soft > down state and has three rechecks set), so it checks again a minute > later. This check succeeds, as the outage was transitory. The parent > host is put back into an "UP" state. As it never was in a hard "down" > state, no notification is sent. This is good. > > 5) since the parent is now up, the child host now is changed to a > (soft I think) "down" state. > > 6) check continue on a normal schedule. As the link does not glitch > again for several hours, parent remains up and child remains > (correctly) down. Three checks later, child enters a hard "down" state > (since it was unreachable and only just switched back to down). Down > notification is sent for child. > > 7) Everything remains good for the next several hours until the link > glitches again. Repeat from step one. > > The notification in step 6 is the problem here - the child host was > down before the glitch, the child host is still down after. But > because the child host was temporarily put in an unreachable state, we > get notified again that it is down, resulting in a string of "DOWN" > messages with no up or real change in status. > >> but one thing I have in mind is try >> >> soft_state_dependencies=0 >> >> Besides that, the problems seems to be in the roots of the check. >> It's not >> healty to have a ping check failing every 2 strikes. Try to change >> the host >> alive check, using a ssh check instead. > > The check is not failing every 2 strikes. It's failing once, briefly, > every few hours - just barely long enough to make one check fail and > throw the parent host into a soft down state. The first recheck (one > minute later) works fine, bringing the parent back to an up state. The > next several hundred or more checks also work fine (as the problem was > transitory and brief). For this reason, changing the check wouldn't > help - for the duration of that single check, the host really is down > (or more precisely, unreachable, as it is a link issue), and any check > I used would say so. > >> Another approach, not so useful, would be to increase the timeout >> for the ping >> (-W) so it will have less chances to fail. > > except that it's not a timeout issue. It is a very real, albeit brief > (around 30 seconds or so), outage. Not long enough or frequent enough > to really impact productivity or anything, but long enough for nagios > to catch it (for a single check). > > ----------------------------------------------- > Israel Brewster > Computer Support Technician II > Frontier Flying Service Inc. > 5245 Airport Industrial Rd > Fairbanks, AK 99709 > (907) 450-7250 x293 > ----------------------------------------------- >> >> Giorgio >> >> Israel Brewster (isr...@frontierflying.com) scritto: >>> >>> So does anyone have any ideas as to how I can resolve this >>> situation? >>> It continues to be an annoyance. Thanks. >>> >>> ----------------------------------------------- >>> Israel Brewster >>> Computer Support Technician II >>> Frontier Flying Service Inc. >>> 5245 Airport Industrial Rd >>> Fairbanks, AK 99709 >>> (907) 450-7250 x293 >>> ----------------------------------------------- >>> >>> >>> >>> On Mar 31, 2009, at 8:17 AM, Israel Brewster wrote: >>> >>>> On Mar 31, 2009, at 1:09 AM, Andreas Ericsson wrote: >>>> >>>>> Israel Brewster wrote: >>>>>> Does nagios (3.0.3) mark a child host as unreachable when its >>>>>> parent enters a soft down state? I am finding myself getting >>>>>> repeated down messages for a host (which is, in fact, down), >>>>>> even >>>>>> though I have notifications set to only send a single message. >>>>>> Looking at the logs, it would appear that what is happening is >>>>>> that the host is flipping between "down" (which notifies me) and >>>>>> "unreachable" (which does not). The parent host, however, never >>>>>> enters a hard down state. Looking at the logs, what I see is >>>>>> that >>>>>> one ICMP check fails, throwing the host into a soft down state, >>>>>> but the next one works just fine, bringing it back to an up >>>>>> state. >>>>>> The logic works fine for the parent host- since it never hits a >>>>>> hard down state, it doesn't alert, and everyone is happy. But >>>>>> apparently with the child host every time this happens, it >>>>>> switches from critical to unreachable and back again, >>>>>> triggering a >>>>>> notification. Is there any way to keep this from happening? >>>>>> Thanks. >>>>> >>>>> Doesn't flapping detection do what you want? You'd get a few >>>>> notifications, but they'd stop after the 3rd flip or something, I >>>>> think. >>>> >>>> Flapping detection helps, but doesn't solve. For one thing, as you >>>> mentioned, you still get at least a couple of notifications before >>>> it >>>> kicks in. For another thing, this happens with a frequency of >>>> something like once an hour or so (not consistently), so the host >>>> will >>>> flip from down to unreachable and back again, triggering an e-mail, >>>> perhaps do it a second time, and then it will sit in the correct >>>> "down" state for the next 50 checks or so (thus canceling any >>>> flapping >>>> detection) before repeating the process. It's not like I'm getting >>>> messages every five minutes or anything, it's just that I'm getting >>>> repeated down messages every hour or two for hosts that have been >>>> down >>>> and haven't actually changed state. >>>> >>>> I could, of course, schedule down time, except that I want to be >>>> notified if/when the people in the remote station get their act >>>> together and get the machine(s) in question back online. Also that >>>> is >>>> only partially effective for machines that have been sent in for >>>> repair, because I don't really know when the scheduled down time >>>> will >>>> be over. They are down, I know they are down, I just don't want to >>>> be >>>> told about it every few hours :-) >>>> >>>> ----------------------------------------------- >>>> Israel Brewster >>>> Computer Support Technician II >>>> Frontier Flying Service Inc. >>>> 5245 Airport Industrial Rd >>>> Fairbanks, AK 99709 >>>> (907) 450-7250 x293 >>>> ----------------------------------------------- >>>> >>>>> >>>>> >>>>> -- >>>>> Andreas Ericsson andreas.erics...@op5.se >>>>> OP5 AB www.op5.se >>>>> Tel: +46 8-230225 Fax: +46 8-230231 >>>>> >>>>> Considering the successes of the wars on alcohol, poverty, drugs >>>>> and >>>>> terror, I think we should give some serious thought to declaring >>>>> war >>>>> on peace. >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> _______________________________________________ >>>> Nagios-users mailing list >>>> Nagios-users@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/nagios-users >>>> ::: Please include Nagios version, plugin version (-v) and OS when >>>> reporting any issue. >>>> ::: Messages without supporting info will risk being sent to /dev/ >>>> null >>> >>> >>> ------------------------------------------------------------------------------ >>> _______________________________________________ >>> Nagios-users mailing list >>> Nagios-users@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/nagios-users >>> ::: Please include Nagios version, plugin version (-v) and OS when >>> reporting any issue. >>> ::: Messages without supporting info will risk being sent to /dev/ >>> null >>> >> > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by: > High Quality Requirements in a Collaborative Environment. > Download a free trial of Rational Requirements Composer Now! > http://p.sf.net/sfu/www-ibm-com > _______________________________________________ > Nagios-users mailing list > Nagios-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nagios-users > ::: Please include Nagios version, plugin version (-v) and OS when > reporting any issue. > ::: Messages without supporting info will risk being sent to /dev/null ------------------------------------------------------------------------------ This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null