In the past I have reported problems when our master server has failed with "Out of memory" problems caused by all server memory and swap space being used up. I have largely (but not completely) solved these by increasing the number of "Command" and "Check result" buffers. However I would like some explanations of the following problems (note that I run 1 master and 5 slave servers - shortly to be come 6 slaves; the master server runs nagios, nsca and ndo2db daemons):
1. When I arrived this morning, there were 27000+ nsca processes waiting to run. Counting the number of processes showed that the number was increasing by at least 10 per second. 2. Recently a restart of the nagios daemon (on the master server) has hung after 27 seconds and does not reach completion. 3. For some restarts of the nagios daemon (for example, after a configuration change), the command pipe cannot be created because there is a normal file in its place - is this real file created by a nsca process ? Can I stop this happening ? 4. After a reboot of the master server to try to fix problems 1 and 2 above (I have tried restarting nsca and nagios, and killing many of the nsca processes), the nagios daemon did not update any of its log files (see the following outputs from command "nagiosstats -c /etc/nagios/nagios.cfg": Nagios Stats 2.10 Copyright (c) 2003-2007 Ethan Galstad (www.nagios.org) Last Modified: 10-21-2007 License: GPL CURRENT STATUS DATA ---------------------------------------------------- Status File: /var/log/nagios/tmpfs/status.dat Status File Age: 0d 0h 56m 56s Status File Version: 2.10 Program Running Time: 0d 0h 57m 34s Nagios PID: 3229 Used/High/Total Command Buffers: 0 / 0 / 40960 Used/High/Total Check Result Buffers: 0 / 0 / 61440 Total Services: 18688 Services Checked: 18688 Services Scheduled: 26 Active Service Checks: 4882 Passive Service Checks: 13806 Total Service State Change: 0.000 / 94.540 / 0.082 % Active Service Latency: 0.207 / 94495564.236 / 19643.884 sec Active Service Execution Time: 0.116 / 31.104 / 0.612 sec Active Service State Change: 0.000 / 94.540 / 0.105 % Active Services Last 1/5/15/60 min: 0 / 0 / 0 / 0 Passive Service State Change: 0.000 / 76.250 / 0.074 % Passive Services Last 1/5/15/60 min: 0 / 0 / 0 / 0 Services Ok/Warn/Unk/Crit: 17257 / 210 / 174 / 1047 Services Flapping: 0 Services In Downtime: 0 Total Hosts: 907 Hosts Checked: 901 Hosts Scheduled: 0 Active Host Checks: 907 Passive Host Checks: 0 Total Host State Change: 0.000 / 20.000 / 0.162 % Active Host Latency: 0.000 / 235.096 / 4.491 sec Active Host Execution Time: 0.000 / 10.127 / 0.358 sec Active Host State Change: 0.000 / 20.000 / 0.162 % Active Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0 Passive Host State Change: 0.000 / 0.000 / 0.000 % Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0 Hosts Up/Down/Unreach: 859 / 48 / 0 Hosts Flapping: 0 Hosts In Downtime: 0 Output from command "nagios -s /etc/nagios/nagios.cfg": Nagios 2.10 Copyright (c) 1999-2007 Ethan Galstad (http://www.nagios.org) Last Modified: 10-21-2007 License: GPL Warning: Host 'Dont know 1 on 184' has no services associated with it! Warning: Host 'Dont know 2 on 184' has no services associated with it! Warning: Host 'babarams1' has no services associated with it! Warning: Host 'babarams1-2' has no services associated with it! Warning: Host 'babarams1-3' has no services associated with it! Warning: Host 'babarams1-4' has no services associated with it! Warning: Host 'babarams2' has no services associated with it! Warning: Host 'babarams2-2' has no services associated with it! Warning: Host 'babarams2-3' has no services associated with it! Warning: Host 'babarams2-4' has no services associated with it! Warning: Host 'c2certdb' has no services associated with it! Warning: Host 'c2certdlf' has no services associated with it! Warning: Host 'c2certlsf' has no services associated with it! Warning: Host 'c2certns' has no services associated with it! Warning: Host 'c2certstager' has no services associated with it! Warning: Host 'ctsc18' has no services associated with it! Warning: Host 'jra1dch01' has no services associated with it! Warning: Host 'jra1dcp01' has no services associated with it! Warning: Host 'swt-4400-1' has no services associated with it! Warning: Host 'swt-5510-1' has no services associated with it! Warning: Host 'swt-5510-2' has no services associated with it! Warning: Host 'swt-5510-3' has no services associated with it! Warning: Host 'swt-5530-0' has no services associated with it! Warning: Host 'swt-55xx-ads' has no services associated with it! Warning: Host 'swt001' has no services associated with it! Warning: Host 'swt002' has no services associated with it! Warning: Host 'swt003' has no services associated with it! Warning: Host 'swt004' has no services associated with it! Warning: Host 'swt005' has no services associated with it! Warning: Host 'swt006' has no services associated with it! Warning: Host 'swt007' has no services associated with it! Warning: Host 'swt008' has no services associated with it! Warning: Host 'swt010' has no services associated with it! Warning: Contact 'guyDaytime' is not a member of any contact groups! Warning: Contact group 'aix-ads-contacts-callout' is not used in any host/service definitions or host/service escalations! Warning: Contact group 'castor-contacts-build' is not used in any host/service definitions or host/service escalations! Warning: Contact group 'castor-contacts-preprod' is not used in any host/service definitions or host/service escalations! Warning: Contact group 'castor-contacts-srmV2' is not used in any host/service definitions or host/service escalations! Warning: Contact group 'corew' is not used in any host/service definitions or host/service escalations! Warning: Contact group 'tape-robot-contacts-callout' is not used in any host/service definitions or host/service escalations! Projected scheduling information for host and service checks is listed below. This information assumes that you are going to start running Nagios with your current config files. HOST SCHEDULING INFORMATION --------------------------- Total hosts: 907 Total scheduled hosts: 0 Host inter-check delay method: SMART Average host check interval: 0.00 sec Host inter-check delay: 0.00 sec Max host check spread: 30 min First scheduled check: N/A Last scheduled check: N/A SERVICE SCHEDULING INFORMATION ------------------------------- Total services: 18688 Total scheduled services: 21 Service inter-check delay method: SMART Average service check interval: 11742.86 sec Inter-check delay: 85.71 sec Interleave factor method: SMART Average services per host: 20.60 Service interleave factor: 1 Max service check spread: 30 min First scheduled check: Wed Mar 12 10:10:11 2008 Last scheduled check: Thu Mar 13 04:00:00 2008 CHECK PROCESSING INFORMATION ---------------------------- Service check reaper interval: 4 sec Max concurrent service checks: Unlimited PERFORMANCE SUGGESTIONS ----------------------- I have no suggestions - things look okay. Output from command "cd /var/log/nagios; ls -ltr . rw tmpfs": rw: total 0 prw-rw---- 1 nagios apache 0 Mar 12 09:41 nagios.cmd .: total 59264 -rw-rw-r-- 1 nagios nagios 2483 Mar 5 08:39 downtime.log drwxr-xr-x 2 nagios nagios 12288 Mar 12 00:00 archives -rw------- 1 nagios nagios 22832729 Mar 12 08:41 retention.dat -rw-r--r-- 1 nagios nagios 15081485 Mar 12 08:42 objects.cache drwxr-sr-x 2 nagios apache 4096 Mar 12 08:42 rw -rw-rw-r-- 1 nagios nagios 96471 Mar 12 08:42 comment.log drwxrwxrwt 2 root root 60 Mar 12 08:42 tmpfs -rw-rw-r-- 1 nagios nagios 22564667 Mar 12 08:42 nagios.log tmpfs/: total 20864 -rw-r--r-- 1 nagios nagios 21333927 Mar 12 08:42 status.dat Any comments, advice etc would be most appreciated as it is getting rather frustrating when nagios does not perform reliably Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Nagios-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
