Bryan, You're on the right track understanding check_load. There are 3 values for warning level and 3 values for the critical level, one each for the 1-minute, 5-minute, and 15-minute load averages. For the check_load plugin, a warning or critical state is achieved if any one (not all three) of the load average thresholds is exceeded.
Depending on what you're trying to measure, you may want to change your thresholds. Since the load is the number of processes ready to run (including those running), the ideal situation is that you have one process ready to run on each core at all times. In other words, on a 24 core box, if your 1-, 5- and 15-minutes load averages are all 24, you're perfectly utilizing all of your CPU capacity. Assuming you're monitoring for excessive load, you'll probably want to set your thresholds higher than the number of cores. Based on experience, I've set warning thresholds for systems I monitor to 9n, 6n, and 3n for 1-, 5-, and 15-minute load averages respectively and the critical thresholds to 15n, 10n, and 5n, where n is the number of cores. These may seem like very high thresholds, especially for the shorter duration averages, but I can tolerate short spikes in load. It's long term excessive loads that concern me. Again, this is based on experience; prior to implementing these settings, I was getting a lot of alerts and much less sleep. :-) Hope that helps. Eric On 7/26/2012 3:08 PM, bryan hunt wrote: > I've got a 24 core box over here, obviously I need to tweak the > configuration of the check_load plugin as it seems designed for a single > core machine by default. > > define service{ > use generic-service > host_name localhost > service_description Current Load > check_command check_load!20!18!16!22!19!18 > } > > > > My understanding is that this breaks down as follows > > 1, 5, 15 minute load average. > > I've set it to the following. > > Warning thresholds. (17 is 70% of 24) > 20!18!16 > > So warn if it is currently 20, or averaging 17. > > Critical thresholds. > 22!19!18 > > Only one core, not maxed out, bad. Average above 22, bad. > > Anyhow, my question is. Is this a sane configuration. It's pretty > generous with load. My usual load average is actually: > > 1.88 2.08 2.16 > > Any advice appreciated, > > Bryan Hunt > > > > > > > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Nagios-users mailing list > Nagios-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nagios-users > ::: Please include Nagios version, plugin version (-v) and OS when reporting > any issue. > ::: Messages without supporting info will risk being sent to /dev/null -- Eric Stanley ___ Developer Nagios Enterprises, LLC Email: estan...@nagios.com Web: www.nagios.com ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null