On 2013-05-23 17:43, C. Bensend wrote: > > Hey folks, > > I recently made two major changes to my Nagios environment: > > 1) I upgraded to v3.5.0. > 2) I moved from a single server to two pollers sending passive > results to one central console server. > > Now, this new distributed system was in place for several months > while I tested, and it worked fine. HOWEVER, since this was running > in parallel with my production system, notifications were disabled. > Hence, I didn't see this problem until I cut over for real and > enabled notifications. > > (please excuse any cut-n-paste ugliness, had to send this info from > my work account via Outlook and then try to cleanse and reformat > via Squirrelmail) > > As a test and to capture information, I reboot 'hostname'. This > log is from the nagios-console host, which is the host that accepts > the passive check results and sends notifications. Here is the > console host receiving a service check failure when the host is > restarting: > > May 22 15:57:10 nagios-console nagios: SERVICE ALERT: hostname;/var disk > queue;CRITICAL;SOFT;1;Connection refused by host > > > So, the distributed poller system checks the host and sends its > results to the console server: > > May 22 15:57:30 nagios-console nagios: HOST > ALERT:hostname;DOWN;SOFT;1;CRITICAL - Host Unreachable (a.b.c.d) > > > And then the centralized server IMMEDIATELY goes into a hard state, > which triggers a notification: > > May 22 15:57:30 nagios-console nagios: HOST ALERT: > hostname;DOWN;HARD;1;CRITICAL - Host Unreachable (a.b.c.d) > May 22 15:57:30 nagios-console nagios: HOST NOTIFICATION: > cbensend;hostname;DOWN;host-notify-by-email-test;CRITICAL - > Host Unreachable (a.b.c.d) > > > Um. Wat? Why would the console immediately trigger a hard > state? The config files don't support this decision. And this > IS a problem with the console server - the distributed monitors > continue checking the host for 6 times like they should. But > for some reason, the centralized console just immediately > calls it a hard state. > > Definitions on the distributed monitoring host (the one running > the actual host and service checks for this host 'hostname': > > define host { > host_name hostname > alias Old production Nagios server > address a.b.c.d > action_url /pnp4nagios/graph?host=$HOSTNAME$ > icon_image_alt Red Hat Linux > icon_image redhat.png > statusmap_image redhat.gd2 > check_command check-host-alive > check_period 24x7 > notification_period 24x7 > contact_groups linux-infrastructure-admins > use linux-host-template > } > > The linux-host-template on that same system: > > define host { > name linux-host-template > register 0 > max_check_attempts 6 > check_interval 5 > retry_interval 1 > notification_interval 360 > notification_options d,r > active_checks_enabled 1 > passive_checks_enabled 1 > notifications_enabled 1 > check_freshness 0 > check_period 24x7 > notification_period 24x7 > check_command check-host-alive > contact_groups linux-infrastructure-admins > } > > And said command to determine up or down: > > define command { > command_name check-host-alive > command_line $USER1$/check_ping -H $HOSTADDRESS$ -w > 5000.0,80% -c 10000.0,100% -p 5 > } > > > Definitions on the centralized console host (the one that notifies): > > define host { > host_name hostname > alias Old production Nagios server > address a.b.c.d > action_url /pnp4nagios/graph?host=$HOSTNAME$ > icon_image_alt Red Hat Linux > icon_image redhat.png > statusmap_image redhat.gd2 > check_command check-host-alive > check_period 24x7 > notification_period 24x7 > contact_groups linux-infrastructure-admins > use linux-host-template,Default_monitor_server > } > > The "Default monitor server" template on the centralized server: > > define host { > name Default_monitor_server > register 0 > active_checks_enabled 0 > passive_checks_enabled 1 > notifications_enabled 1 > check_freshness 0 > freshness_threshold 86400 > } > > And the linux-host-template template on that same centralized host: > > define host { > name linux-host-template > register 0 > max_check_attempts 6 > check_interval 5 > retry_interval 1 > notification_interval 360 > notification_options d,r > active_checks_enabled 1 > passive_checks_enabled 1 > notifications_enabled 1 > check_freshness 0 > check_period 24x7 > notification_period 24x7 > check_command check-host-alive > contact_groups linux-infrastructure-admins > } > > > This is causing some real problems: > > 1) If a single host polling cycle has a blip, it notifies > IMMEDIATELY. > 2) Because it notifies immediately, it ignores host dependencies. > So, when a WAN link goes down for example, it fires off > notifications for *all* hosts at that site as fast as it can, > when it should be retrying, and then walking the dependency tree. > > I do have translate_passive_host_checks=1 on the centralized > monitor, but the way I understand it, that shouldn't effect a > state going from SOFT to HARD. Am I misinterpreting this? > > Another variable - I'm using NConf for the configuration management, > and it does some templating tricks to help with the distributed > monitoring setup. But, all it does is generate config files, and I > don't see any evidence in the configs as to why this would be > happening. > > Any help would be greatly appreciated! >
Set passive_host_checks_are_soft=1 in nagios.cfg on your master server and things should start working as intended. -- Andreas Ericsson andreas.erics...@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. ------------------------------------------------------------------------------ Try New Relic Now & We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, & servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null