Hey folks, I recently made two major changes to my Nagios environment:
1) I upgraded to v3.5.0. 2) I moved from a single server to two pollers sending passive results to one central console server. Now, this new distributed system was in place for several months while I tested, and it worked fine. HOWEVER, since this was running in parallel with my production system, notifications were disabled. Hence, I didn't see this problem until I cut over for real and enabled notifications. (please excuse any cut-n-paste ugliness, had to send this info from my work account via Outlook and then try to cleanse and reformat via Squirrelmail) As a test and to capture information, I reboot 'hostname'. This log is from the nagios-console host, which is the host that accepts the passive check results and sends notifications. Here is the console host receiving a service check failure when the host is restarting: May 22 15:57:10 nagios-console nagios: SERVICE ALERT: hostname;/var disk queue;CRITICAL;SOFT;1;Connection refused by host So, the distributed poller system checks the host and sends its results to the console server: May 22 15:57:30 nagios-console nagios: HOST ALERT:hostname;DOWN;SOFT;1;CRITICAL - Host Unreachable (a.b.c.d) And then the centralized server IMMEDIATELY goes into a hard state, which triggers a notification: May 22 15:57:30 nagios-console nagios: HOST ALERT: hostname;DOWN;HARD;1;CRITICAL - Host Unreachable (a.b.c.d) May 22 15:57:30 nagios-console nagios: HOST NOTIFICATION: cbensend;hostname;DOWN;host-notify-by-email-test;CRITICAL - Host Unreachable (a.b.c.d) Um. Wat? Why would the console immediately trigger a hard state? The config files don't support this decision. And this IS a problem with the console server - the distributed monitors continue checking the host for 6 times like they should. But for some reason, the centralized console just immediately calls it a hard state. Definitions on the distributed monitoring host (the one running the actual host and service checks for this host 'hostname': define host { host_name hostname alias Old production Nagios server address a.b.c.d action_url /pnp4nagios/graph?host=$HOSTNAME$ icon_image_alt Red Hat Linux icon_image redhat.png statusmap_image redhat.gd2 check_command check-host-alive check_period 24x7 notification_period 24x7 contact_groups linux-infrastructure-admins use linux-host-template } The linux-host-template on that same system: define host { name linux-host-template register 0 max_check_attempts 6 check_interval 5 retry_interval 1 notification_interval 360 notification_options d,r active_checks_enabled 1 passive_checks_enabled 1 notifications_enabled 1 check_freshness 0 check_period 24x7 notification_period 24x7 check_command check-host-alive contact_groups linux-infrastructure-admins } And said command to determine up or down: define command { command_name check-host-alive command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 5000.0,80% -c 10000.0,100% -p 5 } Definitions on the centralized console host (the one that notifies): define host { host_name hostname alias Old production Nagios server address a.b.c.d action_url /pnp4nagios/graph?host=$HOSTNAME$ icon_image_alt Red Hat Linux icon_image redhat.png statusmap_image redhat.gd2 check_command check-host-alive check_period 24x7 notification_period 24x7 contact_groups linux-infrastructure-admins use linux-host-template,Default_monitor_server } The "Default monitor server" template on the centralized server: define host { name Default_monitor_server register 0 active_checks_enabled 0 passive_checks_enabled 1 notifications_enabled 1 check_freshness 0 freshness_threshold 86400 } And the linux-host-template template on that same centralized host: define host { name linux-host-template register 0 max_check_attempts 6 check_interval 5 retry_interval 1 notification_interval 360 notification_options d,r active_checks_enabled 1 passive_checks_enabled 1 notifications_enabled 1 check_freshness 0 check_period 24x7 notification_period 24x7 check_command check-host-alive contact_groups linux-infrastructure-admins } This is causing some real problems: 1) If a single host polling cycle has a blip, it notifies IMMEDIATELY. 2) Because it notifies immediately, it ignores host dependencies. So, when a WAN link goes down for example, it fires off notifications for *all* hosts at that site as fast as it can, when it should be retrying, and then walking the dependency tree. I do have translate_passive_host_checks=1 on the centralized monitor, but the way I understand it, that shouldn't effect a state going from SOFT to HARD. Am I misinterpreting this? Another variable - I'm using NConf for the configuration management, and it does some templating tricks to help with the distributed monitoring setup. But, all it does is generate config files, and I don't see any evidence in the configs as to why this would be happening. Any help would be greatly appreciated! Benny -- "The very existence of flamethrowers proves that sometime, somewhere, someone said to themselves, 'You know, I want to set those people over there on fire, but I'm just not close enough to get the job done.'" -- George Carlin ------------------------------------------------------------------------------ Try New Relic Now & We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, & servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null