Preface - I don't use dependencies, but...

> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:nagios-users-
> [EMAIL PROTECTED] On Behalf Of Robert Arends
> Sent: Monday, August 20, 2007 8:58 AM
> To: [email protected]
> Subject: [Nagios-users] Dependency processing during network outage
> causing eventual server hang.
> 


> Each dependency starts off with a 'root' host (our end of the link to
> the customer) and a single dependant host (the next hop).
> After that the dependencies follow the routed path to each host.
> All great so far.

I'd suggest that using the 'parents' directive in your host definitions
is probably a better way to accomplish the above rather than
dependencies. Your use appears to be exactly what it was meant for.
 
> The problem is that when the link to the customer fails, the behaviour
> we have experienced repeatedly is the ultimate death of the server due
> to high process and low RAM.  The server has 2 GB RAM and uses only
> about 1GB in normal operation.

What's using the extra RAM?
 
> The chronology of events is thus:
> 1. link fails
> 2. a leaf host's service is reported as SOFT down.
> 3. The host is checked until 'max_check_attempts' are reached.
> 4. then before the host is reported in the log as HARD down, the
parent
> host in the dependency hierarchy is checked.
> 5. this repeats until the path is traced up to the "network outage"
> root,  3 to 5 levels.
> 6. then this process seems to repeat for each and every service until
> they are ultimately marked as unreachable due to the network outage.

1-5 appear normal. 6 is probably because you're using dependencies but I
would expect nagios to use the last host check state instead of
re-checking. I'm a bit more familiar with the parents logic though...
 
> All the while this is occurring, the "Scheduling Queue" does not move.
> The server processes show a single Nagios process.
> What seems to happen is that the whole Nagios system has become single
> threaded and fixated on checking all services one elongated step at a
> time.

Yup. That's well known behavior. Host check processing is a serial
process. Nagios 2 and prior stops _all_ other processing while hosts are
being checked up to max_check_attempts. You want your host checks to
complete as quickly as possible. Only the minimum amount of pings (if
that's what you use), repeated the minimum amount of check_attempts to
satisfy you that the host is really down.

> As soon as the link was re-established, all the "Scheduling Queue"
tasks
> released and normal operation resumed (provided the server didn't die
> first).

A likely explanation is that host check results return an OK state and
nagios moves on to the next task immediately.
 
--
Marc


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nagios-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Reply via email to