Here's the situation: running nagios 3.2.0, I have two services, we'll call them A and B. Both have event handlers such that if they register a hard critical state, Nagios attempts to restart them. Service B depends on service A, such that when service A goes down, service B does as well, causing them both to need restarted, with A needing to be restarted first. I have a servicedependancy set up in nagios specifying service B's dependancy on service A.

My understanding is that the way this works is that when nagios goes to check service B, it first looks at the "current" state (as defined by the last nagios check) of service A, and, if the execution_failure_criteria matches (i.e. if service A is down) nagios does not run the check on service B, thus not running the event handler to attempt to restart B until A is back up. This is good. But what happens in the following scenario?

Service A is scheduled to check every 5 minutes.
1) Nagios does a normally scheduled check of service A, finding it to be OK.
2) One minute later, Service A crashes
3) One minute after that (three minutes prior to the next regular check of service A), thanks to nagios staggering checks, Nagios goes to do a normal check of service B

Now, to my understanding of this scenario, the check on service B would run normally, since the last check on A was OK, and nagios uses cached results for dependancy checks. Since service A is actually critical, service B will be critical as well. The problem with this is that Nagios will respond by attempting to restart service B, which will invariably fail since service A is still down. Once the next regular check time for service A is reached, Nagios will detect service A as down and restart it, but service B will never get restarted successfully, since nagios already tried and failed. 

Is this correct? If so, what can be done about it? Or is nagios smart enough to schedule its service checks to avoid this scenario? It seems that the most logical solution (if possible) would be to mirror the service/host check logic. That is, when a check of service B comes back as critical, immediately check service A. If service A is critical, then don't declare service B to be critical until service A is OK, at which point B would enter a hard down state and run the event handler. Alternately, if I could say something like always check service A immediately before checking service B to make sure our data is current, that would work as well. Although I could see it resulting in excessive checking of service A, which may be less desirable. What do you guys think? 
-----------------------------------------------
Israel Brewster
Computer Support Technician II
Frontier Flying Service Inc.
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7250 x293
-----------------------------------------------

BEGIN:VCARD
VERSION:3.0
N:Brewster;Israel;;;
FN:Israel Brewster
ORG:Frontier Flying Service;MIS
TITLE:PC Support Tech II
EMAIL;type=INTERNET;type=WORK;type=pref:isr...@frontierflying.com
TEL;type=WORK;type=pref:907-450-7293
item1.ADR;type=WORK;type=pref:;;5245 Airport Industrial Wy;Fairbanks;AK;99701;
item1.X-ABADR:us
CATEGORIES:General
X-ABUID:36305438-95EA-4410-91AB-45D16CABCDDC\:ABPerson
END:VCARD

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Reply via email to