Short version: I have an ever-growing Nagios install for monitoring a bunch of linux hosts (currently 99 hosts & 2322 services, I plan on adding 115 more hosts & 1500+ services). I've noticed something odd with my escalation rules - they're being repeated multiple times in my objects.cache file. This is started to affect performance for parts of my nagios install, to the point where it's painfully slow to use the web interface.
My google-fu is weak today, so I was hoping someone here could point me in the right direction. Longer version: I have 4 escalation rules: -Our helpdesk gets notification #1 for critical issues. -Our on-call person gets notifications 1 -> 12 @ 5 minute intervals 24x7. -The relevant IT-group leader(s) get notifications 5->12 @ 5 minute intervals during on call periods. -Our CIO gets notification 12 -> infinity at 60 minute intervals during on call periods. We use puppet to control our environment, and it's amazing for deploying servers and adding them to nagios. Once I'm able to bring in other aspects of our environment under puppet control (firewall, sudo, yum repos), it will be trivial to set up a server from scratch and monitor it. In order to create a new set of escalation rules, we use a custom class on the puppet server and a small bit of code to be executed from the client-side (of puppet) to make this work. An example: # Escalate to the_boss. He, in turn, will call people. I imagine this # to be along the lines of Hulk nudging Thor playfully in The # Avengers. And sending him flying through a few bulkheads. nagios::server::escalations { "Boss-critical": contact_groups => "the_boss", escalation_options => "c,r", escalation_period => "oncall_hours", first_notification => "12", last_notification => "0", notification_interval => "60", servicegroup_name => "Disk,Ping,HTTP,Load,MySQL,Ping,Procs,SSH,Swap,Users,Zombie", } I know this portion works correctly - it's producing my desired result, which is 1 file per (set) of escalation rules specified. I have 1722 escalation cfg files. The cfg files look something like this: define serviceescalation{ contact_groups the_boss escalation_options c,r escalation_period oncall_hours first_notification 12 host_name my.hostname.xyz last_notification 0 notification_interval 60 #service_description Disk,Ping,HTTP,Load,MySQL,Ping,Procs,SSH,Swap,Users,Zombie servicegroup_name Disk,Ping,HTTP,Load,MySQL,Ping,Procs,SSH,Swap,Users,Zombie } The rules themselves live in the following directory structure: /etc/nagios/escalations/$hostname/$rulename.cfg , and nagios.cfg has an entry to read /etc/nagios/escalations/ as a whole. The rules are written to objects.cache as: define serviceescalation { host_name my.hostname.xyz service_description Zombie first_notification 12 last_notification 0 notification_interval 60.000000 escalation_period oncall_hours escalation_options c,r contacts jabberbot-con contact_groups the_boss } In case you're wondering, the reason we don't wildcard stuff is so we can control it on a per-host basis. It could be that host uvw doesn't require us to monitor MySQL processes, as MySQL isn't installed there. Having an escalation for a non-existing service would mean the nagios config check fails, etc. Now, when I look at my objects.cache file, I see this: Rule #1 Rule #2 Rule #3 Rule #4 (repeat 98 more times) I see the same if I look at a different host - that is, 99 copies of a rule that is particular to that host. Instead of having 9288 escalation rules, I have over 900000 (900 thousand). I looked at my test nagios install (which has a smaller pool of hosts, completely unrelated to my live environment), and it exhibits the same issue. The pool is just small enough that the size of objects.cache didn't matter. My questions to you guys: - Am I crazy to think that it's reading every rule once for *each* server? I thought it was a coincidence, but it's happening in my test setup as well, which is in a completely separate VDC. - Have you seen this before? If so, how did you fix it? - What else should I look at? I'm stumped. I can't find anything tell-tale in logs, strace produces a mountain of gibberish, and I haven't turned up anything online. -Chris B. Some more info, as I'm sure you'll ask for this: I tried using the precache, it didn't help. Both files were created by my nagios install. #ls -la | grep objects -rw-r--r-- 1 nagios nagios 251616779 Oct 1 14:17 objects.cache -rw-r--r-- 1 nagios nagios 251616779 Oct 1 14:16 objects.precache (that's 251mb) # nagios -v Nagios Core 3.3.1 # yum list nagios Installed Packages nagios.x86_64 3.3.1-3.el6 @epel # uname -a Linux nagios.hostname.xyz 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 19:48:22 GMT 2011 x86_64 x86_64 x86_64 GNU/Linux (Centos 6.3) And lastly, my config. I'll be the first to admit it needs some more tweaking, however it's working reasonably well now. # more nagios.cfg ############################################################################## # # NAGIOS.CFG - Sample Main Config File for Nagios 3.3.1 # # Read the documentation for more information on this configuration # file. I've provided some comments here, but things may not be so # clear without further explanation. # # Last Modified: 12-14-2008 # ############################################################################## log_file=/var/log/nagios/nagios.log # You can specify individual object config files as shown below: cfg_file=/etc/nagios/objects/commands.cfg cfg_file=/etc/nagios/objects/contacts.cfg cfg_file=/etc/nagios/objects/timeperiods.cfg cfg_file=/etc/nagios/objects/templates.cfg cfg_file=/etc/nagios/objects/hostgroups.cfg cfg_file=/etc/nagios/objects/oncall.cfg # You can also tell Nagios to process all config files (with a .cfg # extension) in a particular directory by using the cfg_dir # directive as shown below: cfg_dir=/etc/nagios/escalations cfg_dir=/etc/nagios/servers cfg_dir=/etc/nagios/services cfg_dir=/etc/nagios/hostgroups cfg_dir=/etc/nagios/servicegroups object_cache_file=/var/log/nagios/objects.cache precached_object_file=/var/log/nagios/objects.precache resource_file=/etc/nagios/private/resource.cfg status_file=/var/log/nagios/status.dat status_update_interval=10 nagios_user=nagios nagios_group=nagios check_external_commands=1 #command_check_interval=15s command_check_interval=-1 command_file=/var/spool/nagios/cmd/nagios.cmd external_command_buffer_slots=4096 lock_file=/var/run/nagios.pid temp_file=/var/log/nagios/nagios.tmp temp_path=/tmp event_broker_options=-1 broker_module=/usr/lib64/nagios/brokers/npcdmod.o config_file=/etc/pnp4nagios/npcd.cfg log_rotation_method=d log_archive_path=/var/log/nagios/archives use_syslog=1 log_notifications=1 log_service_retries=1 log_host_retries=1 log_event_handlers=1 log_initial_states=0 log_external_commands=1 log_passive_checks=1 global_service_event_handler=remove_service_ack service_inter_check_delay_method=0.01 max_service_check_spread=30 service_interleave_factor=s host_inter_check_delay_method=0.02 max_host_check_spread=30 max_concurrent_checks=0 check_result_reaper_frequency=10 max_check_result_reaper_time=30 check_result_path=/var/log/nagios/spool/checkresults max_check_result_file_age=3600 cached_host_check_horizon=15 cached_service_check_horizon=15 enable_predictive_host_dependency_checks=1 enable_predictive_service_dependency_checks=1 soft_state_dependencies=0 #time_change_threshold=900 auto_reschedule_checks=0 auto_rescheduling_interval=30 auto_rescheduling_window=180 sleep_time=0.25 service_check_timeout=60 host_check_timeout=30 event_handler_timeout=30 notification_timeout=30 ocsp_timeout=5 perfdata_timeout=5 retain_state_information=1 state_retention_file=/var/log/nagios/retention.dat retention_update_interval=60 use_retained_program_state=1 use_retained_scheduling_info=1 retained_host_attribute_mask=0 retained_service_attribute_mask=0 retained_process_host_attribute_mask=0 retained_process_service_attribute_mask=0 retained_contact_host_attribute_mask=0 retained_contact_service_attribute_mask=0 interval_length=60 check_for_updates=1 bare_update_check=0 use_aggressive_host_checking=0 execute_service_checks=1 accept_passive_service_checks=1 execute_host_checks=1 accept_passive_host_checks=1 enable_notifications=1 enable_event_handlers=1 process_performance_data=1 #host_perfdata_command=process-host-perfdata #service_perfdata_command=process-service-perfdata #host_perfdata_file=/tmp/host-perfdata #service_perfdata_file=/tmp/service-perfdata #host_perfdata_file_template=[HOSTPERFDATA]\t$TIMET$\t$HOSTNAME$\t$HOSTEXECUTIONTIME $\t$HOSTOUTPUT$\t$HOSTPERFDATA$ #service_perfdata_file_template=[SERVICEPERFDATA]\t$TIMET$\t$HOSTNAME$\t$SERVICEDESC $\t$SERVICEEXECUTIONTIME$\t$SERVICELATENCY$\t$SERVICEOUTPUT$\t$SERVICEPERFDATA$ #host_perfdata_file_mode=a #service_perfdata_file_mode=a #host_perfdata_file_processing_interval=0 #service_perfdata_file_processing_interval=0 #host_perfdata_file_processing_command=process-host-perfdata-file #service_perfdata_file_processing_command=process-service-perfdata-file obsess_over_services=0 #ocsp_command=somecommand obsess_over_hosts=0 #ochp_command=somecommand translate_passive_host_checks=0 passive_host_checks_are_soft=0 check_for_orphaned_services=1 check_for_orphaned_hosts=1 check_service_freshness=1 service_freshness_check_interval=60 check_host_freshness=0 host_freshness_check_interval=60 additional_freshness_latency=15 enable_flap_detection=1 low_service_flap_threshold=5.0 high_service_flap_threshold=20.0 low_host_flap_threshold=5.0 high_host_flap_threshold=20.0 date_format=us #use_timezone=US/Mountain #use_timezone=Australia/Brisbane p1_file=/usr/sbin/p1.pl enable_embedded_perl=1 use_embedded_perl_implicitly=1 illegal_object_name_chars=`~!$%^&*|'"<>?,()= illegal_macro_output_chars=`~$&|'"<> use_regexp_matching=0 use_true_regexp_matching=0 admin_email=nagios@localhost admin_pager=pagenagios@localhost daemon_dumps_core=0 use_large_installation_tweaks=1 enable_environment_macros=1 #free_child_process_memory=1 #child_processes_fork_twice=1 debug_level=0 debug_verbosity=1 debug_file=/var/log/nagios/nagios.debug max_debug_file_size=1000000 ------------------------------------------------------------------------------ Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null