> -----Original Message----- > From: [EMAIL PROTECTED] [mailto:nagios-users- > [EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] > Sent: Wednesday, February 13, 2008 10:48 AM > To: nagios-users@lists.sourceforge.net > Subject: [Nagios-users] How to explain active host checks to boss > > > Background: Due to management requirements we are using NagiosQL as a > configuration manager for our Nagios install. NagiosQL defaults to active > checks enabled for hosts so this is how it's been done until now. We have > the alerts coming as we want them. We are adding more hosts and services > weekly. I know that active host checks are not a good thing to have going > forward as they are unnecessary. Please advise on the best way to explain > this to the boss who is, at this moment, convinced that if we turn off the > option in the config file then the host will never be checked even if a > service is down. I can't find a good place in the documentation to point > this out and would like to get these turned off in the near future so we > don't run into issues later on down the road. Any help in pointing me in > the right direction would be appreciated. Here is a sample host cfg from > our environment:
Assuming you're using 2.x. The main issue with host checks in 2.x and prior is that they are performed serially, not in parallel. While a host check is being run, nagios stops absolutely everything else, other host/service checks, notifications, etc until that single host check is complete. To put this in perspective, assume that you have 100 hosts checked with 10 pings over a 15 minute check_interval with a max_check_attempts of 3. When every host is up, each host check will take approximately 10 seconds to complete, during which nagios isn't doing anything else except obsessing over that host -- 100 hosts X 10 seconds = 1000 seconds As you can see, you've already exceeded your normal check interval of 900 seconds. Nagios cannot complete the host checks in the time interval you've specified and you haven't even done any service checks yet. Now, nagios will attempt to interleave service checks between host checks to compensate but you've just introduced latency for both check types. Now imagine that you have a simple outage. 5 hosts are down that aren't related via parenting. Your timing now looks like -- (95 hosts X 10 seconds) + (5 hosts X 30 seconds) = 1100 seconds, dedicated to host checks only. Because the host checks aren't related, nagios is able to interleave some service checks between so the latency isn't as bad as it could be. Take the calculation above and determine the effects of a large outage. Factor in parenting, where nagios will only being checking hosts up the tree without interleaving service checks and you start seeing big problems at the time that your monitoring systems is most critical and useful. You could easily end up in a situation where hosts and services aren't being checked for loooooooong intervals. Nagios is smart. You don't need to schedule regular host checks because nagios knows that if there is a problem with a service, it may be caused by an outage of the host or a parent of the host. Nagios will automagically run the host check_command anytime there is a non-OK result from a service check, assuming only that active_checks_enabled is on for the host and there is a valid check_command specified. It will also follow the parents tree if the host check returns non-OK results until nagios finds an OK parent or reaches the top of the tree. Even so, you want to have your host checks finish as quickly as possible; 1 ping max_check_attempts 3 times is usually sufficient to determine status. Nagios-3 introduces parallel host check execution and there are some benefits to running host checks there specifically for caching results for possible use by the on-demand checks or if you're interested in using host performance data for trending for example, but they aren't necessary. Some documentation to help -- http://nagios.sourceforge.net/docs/2_0/xodtemplate.html#host "check_interval: NOTE: Do NOT enable regularly scheduled checks of a host unless you absolutely need to! Host checks are already performed on-demand when necessary, so there are few times when regularly scheduled checks would be needed. Regularly scheduled host checks can negatively impact performance - see the performance tuning tips for more information. This directive is used to define the number of "time units" between regularly scheduled checks of the host. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. More information on this value can be found in the check scheduling documentation." http://nagios.sourceforge.net/docs/2_0/networkreachability.html "The main purpose of Nagios is to monitor services that run on or are provided by physical hosts or devices on your network. It should be obvious that if a host or device on your network goes down, all services that it offers will also go down with it. Similarly, if a host becomes unreachable, Nagios will not be able to monitor the services associated with that host. Nagios recognizes this fact and attempts to check for such a scenario when there are problems with a service. Whenever a service check results in a non-OK status level, Nagios will attempt to check and see if the host that the service is running on is "alive". Typically this is done by pinging the host and seeing if any response is received. If the host check commmand returns a non-OK state, Nagios assumes that there is a problem with the host. In this situation Nagios will "silence" all potential alerts for services running on the host and just notify the appropriate contacts that the host is down or unreachable. If the host check command returns an OK state, Nagios will recognize that the host is alive and will send out an alert for the service that is misbehaving." -- Marc ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null