On 2019-11-27 08:45, Csaba Dobo wrote: > I am investigating this plugin and would like to know the calculating > method. > https://www.monitoring-plugins.org/doc/man/check_load.html > <https://www.monitoring-plugins.org/doc/man/check_load.html> > > |-w, --warning=WLOAD1,WLOAD5,WLOAD15 Exit with WARNING status if load > average exceeds WLOADn -c, --critical=CLOAD1,CLOAD5,CLOAD15 Exit with > CRITICAL status if load average exceed CLOADn the load average format > is the same used by "uptime" and "w"| > So when the system reports 3 values from ie. the uptime it would be > red by the plugin. And what is the evaluation logic? >
Hi Csaba, This plugin will simply return WARNING or CRITICAL when the load is above the specified WARNING and CRITICAL thresholds. This number is expressed as a floating point number. The plugin is very lax about missing thresholds and it will behave as such: 1. Missing LOAD5 and/or LOAD15 value (for either threshold): back-fill from the last given threshold value (LOAD1 or LOAD5) 2. Missing warning or critical value: assume 0 (probably not desired) The load average is the average number of process on the runqueue for the last 1, 5 and 15 minutes. That number include currently running process as well as those scheduled to run (usually if greater than the numbed of cpus/cores) and most importantly processes blocked in interruptible sleep (ex. blocked on I/O). On a purely CPU load, a number equal to the number of cores simply means you're fully utilizing your system resources. Below it is under-utilizing and above it you have processing contention. For I/O load it depends on your I/O capacity and load average isn't the best way to monitor specific block device usage (especially if you have multiple devices as it doesn't tell you which one processes are blocked on). Since load often consist of a mix between the two you have to determine the right value for your specific load and it's usually best when combined with other monitoring methods (like user/system CPU cycles, context switch rate and per-device IO count/average service time). On most system those metrics need a running daemon like sadc (systats) or snmpd to collect as unlike load average they cannot be just read in an instant (plugins that offers this will often just poll for a very short time, between 500ms to 2 seconds, which isn't a representative value and isn't scalable when you need to poll many thousands of machines). Regards, Thomas