Hi Matthieu (and anyone else who might want to throw their hat into the ring):
So after identifying that I have latency times that are around 500-600 seconds I have tried the tuning tips form the nagios docs, however I have fiddled with it and it while after the restart latency drops briefly, then just comes back up to the high levels again. At this point I have only been working with check_reaper_frequency and max_check_result_reaper_time by doubling and halving them from their default values. max_concurrent_checks remains at 0. Load on the server is very low. The machine is a 8 core machine so I really wish I could make better use of it. Load is a measly 1.5 on average. Finally, I tried enable_environment_macros = 0 which actually made it worse, once things quiesced after startup. use_large_installation_tweaks=1 did improve the latency by maybe %30 and I did actually start seeing RRD data come in solid for about 15 minutes but then it returned to being sparse again so while a modest improvement, it still doesn't fill RRD data to have useful data. Any other tuning suggestions? I think I have done everything in the performance tweaks section that seems relevant, including all of those that have been suggested here. In summary, I am looking for some way to make nagios "do more" with the system resources as the host is barely working at all. I really wish there was some way to just make nagios to have some ability to do things more in parallel for cases where a system has plenty of horsepower and RAM. If I have to resort to compiling things with different settings I would be open to trying it, but I just feel like I am grasping at straws now. Here is an typical nagiostats: srwp01mon001:bin$ date; nagiostats Sun Oct 24 17:22:41 UTC 2010 Nagios Stats 3.2.1 Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org<http://www.nagios.org>) Last Modified: 03-09-2010 License: GPL CURRENT STATUS DATA ------------------------------------------------------ Status File: /usr/local/nagios/var/status.dat Status File Age: 0d 0h 0m 16s Status File Version: 3.2.1 Program Running Time: 0d 0h 21m 54s Nagios PID: 9792 Used/High/Total Command Buffers: 0 / 0 / 4096 Total Services: 4987 Services Checked: 4987 Services Scheduled: 4970 Services Actively Checked: 4987 Services Passively Checked: 0 Total Service State Change: 0.000 / 15.990 / 0.006 % Active Service Latency: 0.236 / 683.782 / 536.494 sec Active Service Execution Time: 0.013 / 11.525 / 0.378 sec Active Service State Change: 0.000 / 15.990 / 0.006 % Active Services Last 1/5/15/60 min: 0 / 1565 / 4970 / 4970 Passive Service Latency: 0.000 / 0.000 / 0.000 sec Passive Service State Change: 0.000 / 0.000 / 0.000 % Passive Services Last 1/5/15/60 min: 0 / 0 / 0 / 0 Services Ok/Warn/Unk/Crit: 4972 / 10 / 1 / 4 Services Flapping: 0 Services In Downtime: 0 Total Hosts: 241 Hosts Checked: 241 Hosts Scheduled: 241 Hosts Actively Checked: 241 Host Passively Checked: 0 Total Host State Change: 0.000 / 0.000 / 0.000 % Active Host Latency: 362.793 / 679.309 / 523.157 sec Active Host Execution Time: 0.172 / 4.065 / 3.780 sec Active Host State Change: 0.000 / 0.000 / 0.000 % Active Hosts Last 1/5/15/60 min: 0 / 97 / 241 / 241 Passive Host Latency: 0.000 / 0.000 / 0.000 sec Passive Host State Change: 0.000 / 0.000 / 0.000 % Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0 Hosts Up/Down/Unreach: 241 / 0 / 0 Hosts Flapping: 0 Hosts In Downtime: 0 Active Host Checks Last 1/5/15 min: 22 / 100 / 257 Scheduled: 22 / 97 / 242 On-demand: 0 / 3 / 15 Parallel: 22 / 97 / 242 Serial: 0 / 0 / 0 Cached: 0 / 3 / 15 Passive Host Checks Last 1/5/15 min: 0 / 0 / 0 Active Service Checks Last 1/5/15 min: 262 / 1779 / 5436 Scheduled: 262 / 1779 / 5436 On-demand: 0 / 0 / 0 Cached: 0 / 0 / 0 Passive Service Checks Last 1/5/15 min: 0 / 0 / 0 External Commands Last 1/5/15 min: 0 / 0 / 0 Here is nagios -s: # /usr/local/nagios/bin/nagios -s /usr/local/nagios/etc/nagios.cfg Nagios Core 3.2.1 Copyright (c) 2009-2010 Nagios Core Development Team and Community Contributors Copyright (c) 1999-2009 Ethan Galstad Last Modified: 03-09-2010 License: GPL Website: http://www.nagios.org Timing information on object configuration processing is listed below. You can use this information to see if precaching your object configuration would be useful. Object Config Source: Config files (uncached) OBJECT CONFIG PROCESSING TIMES (* = Potential for precache savings with -u option) ---------------------------------- Read: 0.008987 sec Resolve: 0.000533 sec * Recomb Contactgroups: 0.000075 sec * Recomb Hostgroups: 0.003513 sec * Dup Services: 0.025789 sec * Recomb Servicegroups: 0.048340 sec * Duplicate: 0.037513 sec * Inherit: 0.003420 sec * Recomb Contacts: 0.000000 sec * Sort: 0.000000 sec * Register: 0.038780 sec Free: 0.003135 sec ============ TOTAL: 0.170086 sec * = 0.119184 sec (70.07%) estimated savings RETENTION DATA TIMES ---------------------------------- Read and Process: 0.352939 sec ============ TOTAL: 0.352939 sec Timing information on configuration verification is listed below. CONFIG VERIFICATION TIMES (* = Potential for speedup with -x option) ---------------------------------- Object Relationships: 0.063209 sec Circular Paths: 5.735947 sec * Misc: 0.003824 sec ============ TOTAL: 5.802980 sec * = 5.735947 sec (98.8%) estimated savings EVENT SCHEDULING TIMES ------------------------------------- Get service info: 0.007308 sec Get host info info: 0.000356 sec Get service params: 0.000011 sec Schedule service times: 0.016611 sec Schedule service events: 0.053224 sec Get host params: 0.000002 sec Schedule host times: 0.000752 sec Schedule host events: 0.009029 sec ============ TOTAL: 0.087293 sec Projected scheduling information for host and service checks is listed below. This information assumes that you are going to start running Nagios with your current config files. HOST SCHEDULING INFORMATION --------------------------- Total hosts: 241 Total scheduled hosts: 241 Host inter-check delay method: SMART Average host check interval: 199.92 sec Host inter-check delay: 0.83 sec Max host check spread: 30 min First scheduled check: Sun Oct 24 17:26:17 2010 Last scheduled check: Sun Oct 24 17:28:46 2010 SERVICE SCHEDULING INFORMATION ------------------------------- Total services: 4987 Total scheduled services: 4970 Service inter-check delay method: SMART Average service check interval: 179.98 sec Inter-check delay: 0.04 sec Interleave factor method: SMART Average services per host: 20.69 Service interleave factor: 21 Max service check spread: 30 min First scheduled check: Sun Oct 24 17:26:25 2010 Last scheduled check: Sun Oct 24 17:29:24 2010 CHECK PROCESSING INFORMATION ---------------------------- Check result reaper interval: 30 sec Max concurrent service checks: Unlimited PERFORMANCE SUGGESTIONS ----------------------- I have no suggestions - things look okay. Well, I hate to say it, but I think not! On Oct 24, 2010, at 10:58 AM, Mathieu Gagné wrote: On 2010-10-24 03:54, Litwin, Matthew wrote: You hit the nail on the head. Changing MaxBytes to a very large number made latency totally dwarf execution time. So now what do I do? Try disabling environment variables in nagios.cfg: enable_environment_macros = 0 This didn't help at all, and may have made latency increase! Our latency dropped from 20 minutes to 10 seconds after this change. This guy had a similar issue back then: http://marc.info/?l=nagios-devel&m=120393376922635 You should also try to enable large installation tweaks: use_large_installation_tweaks=1 Documentation here: http://nagios.sourceforge.net/docs/3_0/largeinstalltweaks.html And adjust those configurations based on your installation: check_result_reaper_frequency max_concurrent_checks As I mentioned, I have tried all sorts of permutations of this to no real effect. I have max_concurrent_checks=0 (no limit) which is the default. max_host_check_spread max_service_check_spread What does this do exactly that might effect latency? This seems only relevant to behavior after nagios starts up, correct? -- Mathieu Thanks again for yours and everyone else's advice up to this point, ------------------------------------------------------------------------------ Nokia and AT&T present the 2010 Calling All Innovators-North America contest Create new apps & games for the Nokia N8 for consumers in U.S. and Canada $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store http://p.sf.net/sfu/nokia-dev2dev _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null