Preface - I don't use dependencies, but... > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:nagios-users- > [EMAIL PROTECTED] On Behalf Of Robert Arends > Sent: Monday, August 20, 2007 8:58 AM > To: [email protected] > Subject: [Nagios-users] Dependency processing during network outage > causing eventual server hang. >
> Each dependency starts off with a 'root' host (our end of the link to > the customer) and a single dependant host (the next hop). > After that the dependencies follow the routed path to each host. > All great so far. I'd suggest that using the 'parents' directive in your host definitions is probably a better way to accomplish the above rather than dependencies. Your use appears to be exactly what it was meant for. > The problem is that when the link to the customer fails, the behaviour > we have experienced repeatedly is the ultimate death of the server due > to high process and low RAM. The server has 2 GB RAM and uses only > about 1GB in normal operation. What's using the extra RAM? > The chronology of events is thus: > 1. link fails > 2. a leaf host's service is reported as SOFT down. > 3. The host is checked until 'max_check_attempts' are reached. > 4. then before the host is reported in the log as HARD down, the parent > host in the dependency hierarchy is checked. > 5. this repeats until the path is traced up to the "network outage" > root, 3 to 5 levels. > 6. then this process seems to repeat for each and every service until > they are ultimately marked as unreachable due to the network outage. 1-5 appear normal. 6 is probably because you're using dependencies but I would expect nagios to use the last host check state instead of re-checking. I'm a bit more familiar with the parents logic though... > All the while this is occurring, the "Scheduling Queue" does not move. > The server processes show a single Nagios process. > What seems to happen is that the whole Nagios system has become single > threaded and fixated on checking all services one elongated step at a > time. Yup. That's well known behavior. Host check processing is a serial process. Nagios 2 and prior stops _all_ other processing while hosts are being checked up to max_check_attempts. You want your host checks to complete as quickly as possible. Only the minimum amount of pings (if that's what you use), repeated the minimum amount of check_attempts to satisfy you that the host is really down. > As soon as the link was re-established, all the "Scheduling Queue" tasks > released and normal operation resumed (provided the server didn't die > first). A likely explanation is that host check results return an OK state and nagios moves on to the next task immediately. -- Marc ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nagios-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
