As I have reported in the past I have 2 slave servers and a master server; all checks should be run from the slave servers and passed back to the master server. I have been recently trying the understand why the master server still has kernel "Out of memory" problems such that the kernel starts killing active processes and, in some cases, panics because there are no more processes to kill (this happens perhaps once or twice per week usually around 4:50 - 5:10 in the morning). As part of my investigations I have noticed that for a typical host 40% of tests are reported from the slave and 60% are run by the master. I can tell this because 40% of messages for this typical host in /var/log/nagios on the master server begin "EXTERNAL_COMMAND" and 60% of messages begin "Warning:". My question is why this should be ? Here is a copy of nagios.log from the master server for one test of one host for today (so far):
[1177369200] CURRENT SERVICE STATE: csflnx119;SPACE_TMP;OK;HARD;1;DISK OK - free space: /tmp 672 MB (70% inode=99%): [1177369894] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 41 seconds (threshold=1817 seconds). I'm forcing an immediate check of the service. [1177370925] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space: /tmp 672 MB (70% inode=99%): [1177373014] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 43 seconds (threshold=2052 seconds). I'm forcing an immediate check of the service. [1177374874] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 43 seconds (threshold=1816 seconds). I'm forcing an immediate check of the service. [1177376734] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 41 seconds (threshold=1817 seconds). I'm forcing an immediate check of the service. [1177377158] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space: /tmp 672 MB (70% inode=99%): [1177379494] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 33 seconds (threshold=2305 seconds). I'm forcing an immediate check of the service. [1177381354] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 39 seconds (threshold=1818 seconds). I'm forcing an immediate check of the service. [1177383214] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 43 seconds (threshold=1816 seconds). I'm forcing an immediate check of the service. [1177387073] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space: /tmp 660 MB (68% inode=99%): [1177389102] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 13 seconds (threshold=5089 seconds). I'm forcing an immediate check of the service. [1177390507] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space: /tmp 660 MB (68% inode=99%): [1177392635] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 11 seconds (threshold=2118 seconds). I'm forcing an immediate check of the service. [1177394495] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 39 seconds (threshold=1818 seconds). I'm forcing an immediate check of the service. [1177396362] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 36 seconds (threshold=1823 seconds). I'm forcing an immediate check of the service. [1177397210] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space: /tmp 660 MB (68% inode=99%): [1177399813] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 47 seconds (threshold=2562 seconds). I'm forcing an immediate check of the service. [1177401674] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 40 seconds (threshold=1818 seconds). I'm forcing an immediate check of the service. [1177403749] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 28 seconds (threshold=1931 seconds). I'm forcing an immediate check of the service. [1177404093] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space: /tmp 660 MB (68% inode=99%): [1177406037] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 42 seconds (threshold=1902 seconds). I'm forcing an immediate check of the service. [1177410112] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 184 seconds (threshold=2853 seconds). I'm forcing an immediate check of the service. [1177410863] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space: /tmp 660 MB (68% inode=99%): [1177413485] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 30 seconds (threshold=2579 seconds). I'm forcing an immediate check of the service. [1177415948] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 40 seconds (threshold=2119 seconds). I'm forcing an immediate check of the service. [1177417738] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space: /tmp 660 MB (68% inode=99%): [1177420390] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 29 seconds (threshold=2631 seconds). I'm forcing an immediate check of the service. [1177423551] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 14 seconds (threshold=2481 seconds). I'm forcing an immediate check of the service. [1177424385] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space: /tmp 660 MB (68% inode=99%): [1177426431] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 56 seconds (threshold=1990 seconds). I'm forcing an immediate check of the service. [1177428291] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 43 seconds (threshold=1816 seconds). I'm forcing an immediate check of the service. The nagios.log file on the slave server only contains the "CURRENT SERVICE STATE:" entries for this server and test combination. Why would this be ? Is it because the slave server is configured to "obsess_over_services" ? There are a few entries in the nagios.log file for this host, but they refer only to Warnings (there were no critical problems on this host). I have compared the retention data file entries for this service and they are not significantly different. I have also run nagios -s /etc/nagios/nagios.cfg on the master and the slave servers; the output on both systems says "I have no suggestions - thinks look okay". So has the list any suggestions ? Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null