Don't know if it helps, but on my services, I check services every 5 minutes. If it fails, it retries 3 times every 30 seconds (so a max. of 1.5 minutes) then it sends me an e-mail/SMS (because it switches to HARD state.)
What this will do... max_check_attempts 3 retry_check_interval 5 normal_check_interval 5 ...is it will check your service every 5 minutes - if it goes off-line, it will set it to a SOFT fail, wait 5 minutes then check again, if it still fails (2nd SOFT fail), wait 5 minutes, then check again, then if it fails a third time, you'll get a HARD fail - so in theory, if the service is down, you won't find out for 15 minutes. What might be better is: max_check_attempts 3 retry_check_interval 1 normal_check_interval 5 This will check your service every 5 minutes - if it fails, it'll re-try 3 times with a 1 minute interval between each, so you'll get notified if it's still down after 3 minutes. What you can also do is set retry_check_interval to a seconds interval, like: max_check_attempts 3 retry_check_interval 5s normal_check_interval 5 This tells Nagios to wait 5 seconds between non-OK states, and 5 minutes between active checks. You could of course also set "max_check_attempts" to 2 and "retry_check_interval" to 1 - so the first-time it fails, it waits a minute then checks again - and if it still fails you get a notification, so in theory you only get a minute's lag. hope this random rambling works for you :) Andy. wnorth wrote: > That is actually interesting, when the host goes down I see a HARD service > alert as follows: > > HOST ALERT: ebro;DOWN;HARD;5;CRITICAL - Host Unreachable (10.0.33.8) > > But for the check_http I only see the following: > > SERVICE ALERT: ebro;Website App Server MS2;CRITICAL;SOFT;3;Connection > refused > > Once I changed the retry interval to 1 and the max attempts to 1 I saw the > email, so I just wasn't waiting long enough...makes sense. In theory I would > want it to try 3 times in a row, if it fails send an email, then wait 5 > minutes and retry again. > > For that to work I tried the following: > max_check_attempts 3 > retry_check_interval 5 > normal_check_interval 5 > > This should force it to try 3 times before setting a HARD alert and wait 5 > minutes between normal intervals, however that is not what it does, it > appears it sets the retry_check_interval to 5 minutes between non-OK service > alerts, so if I tell it to try 3 times, it will try 3 times and wait > in-between tries for 5 minutes, if I set it to 2 on the retry it will wait 2 > minutes in between tries, which comes out to a total of 6 minutes. I'd > rather it fail after a minute or so, so if I set it to 0 it will inherit a > standard minute...the only way to solve this is to set it at a 1 minute > interval and just wait. > > Sound about right? > > -----Original Message----- > From: Josh Yost [mailto:[EMAIL PROTECTED] > Sent: Friday, January 05, 2007 3:56 PM > To: [EMAIL PROTECTED] > Cc: [email protected] > Subject: Re: [Nagios-users] Service Alerts and Notifications > > Hi, > This is kind of stupid/obvious, but > > a) I don't see a HARD service alert in your log snip for the service. > Did it actually get to that state? Your retry interval is 3 min, so it > would take you 15 min or so to get an alert. > > b) If it did get to HARD, what was the cmd it tried to run & is that a > valid cmd? > > c) Did you kill all the old processes and restart Nagios w/ the new config? > > I don't see anything obvious in your cfgs that wouldn't be working. > > - Josh > > > [EMAIL PROTECTED] wrote: > >> I have setup a few host and HTTP service checks and alerts. When a host >> > goes down I recieve an email, but when the check_http service fails (e.g. > the TCP port is shutdown on the web server) I see the service alert in the > nagios.log as follows: > >> [1168038639] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;ebro;Website App >> > Server MS2;1168038636 > >> [1168038644] SERVICE ALERT: ebro;Website App Server >> > MS2;CRITICAL;SOFT;1;Connection refused > >> [1168038824] SERVICE ALERT: ebro;Website App Server >> > MS2;CRITICAL;SOFT;2;Connection refused > >> [1168039004] SERVICE ALERT: ebro;Website App Server >> > MS2;CRITICAL;SOFT;3;Connection refused > >> But I do not recieve an email. The following service is defined: >> >> define service{ >> host_name ebro >> service_description Website App Server MS2 >> check_command check_http_fitness_app >> max_check_attempts 5 >> normal_check_interval 5 >> retry_check_interval 3 >> check_period 24x7 >> contact_groups jboss-admins >> notification_interval 30 >> notification_period 24x7 >> notification_options w,u,c,r,f >> } >> >> The following contact is setup for the jboss-admins groups: >> >> define contactgroup{ >> contactgroup_name jboss-admins >> alias JBoss Administrators >> members wnorth >> } >> >> The following contact is setup for wnorth: >> define contact{ >> contact_name wnorth >> alias Wes North >> service_notification_period 24x7 >> host_notification_period 24x7 >> service_notification_options w,u,c,r,f >> host_notification_options d,u,r,f >> service_notification_commands notify-by-email >> host_notification_commands host-notify-by-email >> email [EMAIL PROTECTED] >> } >> >> If I bring a host offline I see the following alert in the nagios.log: >> >> [1168037707] HOST NOTIFICATION: >> > wnorth;ebro;DOWN;host-notify-by-email;CRITICAL - Host Unreachable > (10.0.33.8) > >> [1168037767] HOST ALERT: ebro;UP;HARD;1;PING OK - Packet loss = 0%, RTA = >> > 0.40 ms > >> [1168037767] HOST NOTIFICATION: wnorth;ebro;UP;host-notify-by-email;PING >> > OK - Packet loss = 0%, RTA = 0.40 ms > >> But if I bring a web service offline it fails to email me. I don't know >> > why, I have specified everything correctly. Any insight would be much > appreciated. > >> -Wes >> >> >> ------------------------------------------------------------------------- >> Take Surveys. Earn Cash. Influence the Future of IT >> Join SourceForge.net's Techsay panel and you'll get the chance to share >> > your > >> opinions on IT & business topics through brief surveys - and earn cash >> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV >> _______________________________________________ >> Nagios-users mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/nagios-users >> ::: Please include Nagios version, plugin version (-v) and OS when >> > reporting any issue. > >> ::: Messages without supporting info will risk being sent to /dev/null >> > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Nagios-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nagios-users > ::: Please include Nagios version, plugin version (-v) and OS when reporting > any issue. > ::: Messages without supporting info will risk being sent to /dev/null > > !DSPAM:37,459eeb74137101726516177! > > > -- Andy Shellam NetServe Support Team the Mail Network "an alternative in a standardised world" p: +44 (0) 121 288 0832/0839 m: +44 (0) 7818 000834 ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nagios-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
