Hi all, I have a weird problem that I'd like to share and hopefully someone has some insight.
Sorry for the lengthy dump below, but I find that producing a detailed question/info makes the email dialog initially more efficient. Environment: FC2 / Nagios 2.6 With the following stats of Nagios (from Tactical) Service Check Execution Time: 0.00 / 1.92 / 0.505 sec Service Check Latency: 0.00 / 3.29 / 0.432 sec Host Check Execution Time: 1.01 / 40.16 / 3.049 sec Host Check Latency: 0.00 / 0.00 / 0.000 sec # Active Host / Service Checks: 316 / 753 # Passive Host / Service Checks: 0 / 185 Most of the Passive service checks are SNMP Traps added via SNMPTT/SEC The 316 hosts (each with at least 2 services - check_ping & check_snmp) are broken in to 6 dependency trees. Each tree has the following number of hosts - Ahl 161, Hbs 45, Ebs 14, Imc 55, Ntl 35, Nut 6. These represent customer's hosts and are accessed via a single link to their network. Each dependency starts off with a 'root' host (our end of the link to the customer) and a single dependant host (the next hop). After that the dependencies follow the routed path to each host. All great so far. The problem is that when the link to the customer fails, the behaviour we have experienced repeatedly is the ultimate death of the server due to high process and low RAM. The server has 2 GB RAM and uses only about 1GB in normal operation. The chronology of events is thus: 1. link fails 2. a leaf host's service is reported as SOFT down. 3. The host is checked until 'max_check_attempts' are reached. 4. then before the host is reported in the log as HARD down, the parent host in the dependency hierarchy is checked. 5. this repeats until the path is traced up to the "network outage" root, 3 to 5 levels. 6. then this process seems to repeat for each and every service until they are ultimately marked as unreachable due to the network outage. All the while this is occurring, the "Scheduling Queue" does not move. The server processes show a single Nagios process. What seems to happen is that the whole Nagios system has become single threaded and fixated on checking all services one elongated step at a time. Not even the hosts in *other* dependency trees are being processed. The nagios.log shows snmp traps entering via the passive cmd interface, but from within the gui, the "alert history" does not show them. We've had the 'max_check_attempts' set to 12 and found the above scenario ultimately (15-40 minutes) turns into a ... "Warning: A system time change of 1116 seconds (forwards in time) has been detected. Compensating..." Message in nagios.log. This is followed by many of ... "Warning: The check of service 'NTL-PING' on host 'ntl*****' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service..." At this point the number of process reaches in excess of 600 and it is just a matter of time(30 mins) before the only option is to power off the server. This has happened 5 or 6 times before, today we tested this in a controlled environment and reproduced it easily. Next we reduced the 'max_check_attempts' to 2 and found the above chronology is the same, but never got the time adjustment after 50 minutes, but did see more of the service/host/parent checks as mentioned above. As soon as the link was re-established, all the "Scheduling Queue" tasks released and normal operation resumed (provided the server didn't die first). Has anyone seen this sort of thing before? I've looked at the change-log for 2.7/2.8/2.9 to see if there are fixes for this sort of thing but no luck. Rob :-) ________________________________ Robert Arends, Systems Engineer. Direct 03 9863 1334 * Mobile 0412 412 345 * Email [EMAIL PROTECTED] Web www.imc.net.au * Helpdesk 1300 555 IMC * Managed Services 02 9006 8282 (24hrs) ________________________________ This email and any attachments transmitted with it are confidential and may contain legally privileged information. If you are not the intended recipient you are prohibited from disclosing, copying or using the information contained in it. If you have received this email in error, please notify the sender immediately by return email and then delete all copies of this transmission together with any attachments. It is the addressee's/recipient's duty to virus scan and otherwise test the email before loading it onto any computer system. IMC Communications does not accept liability in connection with any computer virus, data corruption, delay, interruption, unauthorised access or unauthorised amendment in relation to this email. For information about our privacy policy, visit the IMC Communications website at www.imc.net.au This email has been checked by IMC's SMTP gateway. -&- ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nagios-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
