Marc, Thanks for your answer. We are indeed using the PARENTS directive, apologies for using the wrong word - I had no idea until reading further last night that there are also dependency directives. From my perspective the PARENTS directive produced a dependant hierarchy, so I called it 'dependency'. We also came from using Whats Up Gold, and there the parents concept is called dependency.
The other RAM user is MRTG, but RAM is not the issue. RAM use does not climb except when 600 processes start at once.!!! :Marc said: Host check processing is a serial process. Nagios 2 and prior stops _all_ other processing while hosts are being checked up to max_check_attempts :: Wow, ok so we can tune this to reduce the impact of the problem. Now that you know we are using parents logic, can you revisit your answer? Especially re point 6 below. > 6. then this process seems to repeat for each and every service until > they are _all_ ultimately marked as unreachable due to the network outage. Based on the above experience (point 6)... Even with max_check_attempts set to 2, that would be ... 161 leaf-hosts x ~2 services + ~100 parents = 422 x 2 minutes = 844 minutes to fully check the entire outage and return to checking the other parent trees. I think you are saying that Nagios 3 continues to check other hosts while dealing with a network outage - how stable is v3? Rob :-) -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Marc Powell Sent: Tuesday, August 21, 2007 12:38 AM To: [email protected] Subject: Re: [Nagios-users] Dependency processing during network outagecausing eventual server hang. Preface - I don't use dependencies, but... > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:nagios-users- > [EMAIL PROTECTED] On Behalf Of Robert Arends > Sent: Monday, August 20, 2007 8:58 AM > To: [email protected] > Subject: [Nagios-users] Dependency processing during network outage > causing eventual server hang. > > Each dependency starts off with a 'root' host (our end of the link to > the customer) and a single dependant host (the next hop). > After that the dependencies follow the routed path to each host. > All great so far. I'd suggest that using the 'parents' directive in your host definitions is probably a better way to accomplish the above rather than dependencies. Your use appears to be exactly what it was meant for. > The problem is that when the link to the customer fails, the behaviour > we have experienced repeatedly is the ultimate death of the server due > to high process and low RAM. The server has 2 GB RAM and uses only > about 1GB in normal operation. What's using the extra RAM? > The chronology of events is thus: > 1. link fails > 2. a leaf host's service is reported as SOFT down. > 3. The host is checked until 'max_check_attempts' are reached. > 4. then before the host is reported in the log as HARD down, the parent > host in the dependency hierarchy is checked. > 5. this repeats until the path is traced up to the "network outage" > root, 3 to 5 levels. > 6. then this process seems to repeat for each and every service until > they are ultimately marked as unreachable due to the network outage. 1-5 appear normal. 6 is probably because you're using dependencies but I would expect nagios to use the last host check state instead of re-checking. I'm a bit more familiar with the parents logic though... > All the while this is occurring, the "Scheduling Queue" does not move. > The server processes show a single Nagios process. > What seems to happen is that the whole Nagios system has become single > threaded and fixated on checking all services one elongated step at a > time. Yup. That's well known behavior. Host check processing is a serial process. Nagios 2 and prior stops _all_ other processing while hosts are being checked up to max_check_attempts. You want your host checks to complete as quickly as possible. Only the minimum amount of pings (if that's what you use), repeated the minimum amount of check_attempts to satisfy you that the host is really down. > As soon as the link was re-established, all the "Scheduling Queue" tasks > released and normal operation resumed (provided the server didn't die > first). A likely explanation is that host check results return an OK state and nagios moves on to the next task immediately. -- Marc ------------------------------------------------------------------------ - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nagios-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null This email and any attachments transmitted with it are confidential and may contain legally privileged information. If you are not the intended recipient you are prohibited from disclosing, copying or using the information contained in it. If you have received this email in error, please notify the sender immediately by return email and then delete all copies of this transmission together with any attachments. It is the addressee's/recipient's duty to virus scan and otherwise test the email before loading it onto any computer system. IMC Communications does not accept liability in connection with any computer virus, data corruption, delay, interruption, unauthorised access or unauthorised amendment in relation to this email. For information about our privacy policy, visit the IMC Communications website at www.imc.net.au This email has been checked by IMC's SMTP gateway. -&- ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nagios-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
