[Nagios-users] Nagios 'Out Of Memory' Problems

Armistead, Raffy Thu, 23 Mar 2006 10:24:08 -0800

I have a problem with my Nagios server constantly crashing. It keeps outputting on the screen Out of Memory errors which causes loss of access to the server. I can ping the box but I cannot SSH or web into it to view any information. This has been happening increasingly more lately. Now it is about every 2-3 days that this is occurring. We have been adding more and more devices to the servers and this problem has been increasing as this occurs. This is how I have it set up.

I have a Main Nagios server that is running the latest 2.0 (stable) Nagios release. It is monitoring about 6800 devices but it is not actively checking the devices. Its main role is to provide a web interface and receive passive polls from three other servers which do the polling. The main server also does email notifications when a device goes down. The server sends about 30-40 emails a day. I am using NSCA 2.5 between the server and the client Nagios servers. I am only monitoring one service for each device which is either TCP or ping depending on the device. Mostly all devices are monitored with TCP (roughly 6000). The rest are monitored with ping. The individual servers are pretty evenly spread with the number of devices. They are about 2000-2500 each.

All the servers are just basic computers, Dell Dimension 2400s with base hardware. The main server was upgraded to 2GB RAM while the other servers are running 512MB each. They are all running Celeron 2.4 GHz processors. The individual servers are not having out of memory problems and they are running the latest 2.0 (stable) release as well. They all run RedHat 9.0 with everything installed for the packages.

Can someone please help me in resolving this problem? Thanks.

The TOP process does not appear like it is running out of memory. This is the normal output when the server has been running for a few hours.

57 processes: 54 sleeping, 3 running, 0 zombie, 0 stopped

CPU states: 41.1% user 58.8% system 0.0% nice 0.0% iowait 0.0% idle

Mem: 2063556k av, 285940k used, 1777616k free, 0k shrd, 41056k buff

177644k actv, 51688k in_d, 10892k in_c

Swap: 1044184k av, 0k used, 1044184k free 114208k cached

Here is a sample configuration that I have on the devices on the main server:

hosts.cfg

define host {

name generic-host ; The name of this host template - referenced in other host definitions, used for template recursion/resolution

notifications_enabled 1 ; Host notifications are enabled

event_handler_enabled 0 ; Host event handler is enabled

flap_detection_enabled 1 ; Flap detection is enabled

process_perf_data 1 ; Process performance data

retain_status_information 1 ; Retain status information across program restarts

retain_nonstatus_information 1 ; Retain non-status information across program restarts

max_check_attempts 10

notification_interval 720

notification_period 24x7

obsess_over_host 0

notification_options d,u,r,f

}

define host {

use generic-host ; Name of host template to use

host_name DETAH-R1

alias DETAH-R1

address x.x.x.x

check_command check_ping!200,40%!10000,100%

contact_groups device-admins,DETAH-admins,router-admins

}

services.cfg

define service {

name generic-service ; The 'name' of this service template, referenced in other service definitions

active_checks_enabled 0 ; Active service checks are enabled

passive_checks_enabled 1 ; Passive service checks are enabled/accepted

parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)

obsess_over_service 0 ; We should obsess over this service (if necessary)

check_freshness 1 ; Default is to NOT check service 'freshness'

freshness_threshold 1800

notifications_enabled 1 ; Service notifications are enabled

event_handler_enabled 0 ; Service event handler is enabled

flap_detection_enabled 1 ; Flap detection is enabled

process_perf_data 1 ; Process performance data

retain_status_information 1 ; Retain status information across program restarts

retain_nonstatus_information 1 ; Retain non-status information across program restarts

is_volatile 0

check_period 24x7

max_check_attempts 6

normal_check_interval 20

retry_check_interval 5

notification_interval 720

notification_period 24x7

notification_options n

}

define service {

use generic-service ; Name of service template to use

host_name DETAH-R1

service_description PING

contact_groups device-admins,DETAH-admins,router-admins

check_command check_ping!200,40%!1000,100%

}

Here is a sample config on the individual server.

hosts.cfg

define host {

name generic-host ; The name of this host template - referenced in other host definitions, used for template recursion/resolution

notifications_enabled 1 ; Host notifications are enabled

event_handler_enabled 0 ; Host event handler is enabled

flap_detection_enabled 1 ; Flap detection is enabled

process_perf_data 1 ; Process performance data

retain_status_information 1 ; Retain status information across program restarts

retain_nonstatus_information 1 ; Retain non-status information across program restarts

max_check_attempts 10

notification_interval 720

notification_period 24x7

obsess_over_host 0

notification_options d,u,r,f

}

define host {

use generic-host ; Name of host template to use

host_name DETAH-R1

alias DETAH-R1

address x.x.x.x

check_command check_ping!200,40%!10000,100%

contact_groups device-admins,DETAH-admins,router-admins

}

services.cfg

define service {

name generic-service ; The 'name' of this service template, referenced in other service definitions

active_checks_enabled 1 ; Active service checks are enabled

passive_checks_enabled 1 ; Passive service checks are enabled/accepted

parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)

obsess_over_service 1 ; We should obsess over this service (if necessary)

check_freshness 1 ; Default is to NOT check service 'freshness'

freshness_threshold 1800

notifications_enabled 1 ; Service notifications are enabled

event_handler_enabled 0 ; Service event handler is enabled

flap_detection_enabled 1 ; Flap detection is enabled

process_perf_data 1 ; Process performance data

retain_status_information 1 ; Retain status information across program restarts

retain_nonstatus_information 1 ; Retain non-status information across program restarts

is_volatile 0

check_period 24x7

max_check_attempts 6

normal_check_interval 20

retry_check_interval 5

notification_interval 720

notification_period 24x7

notification_options n

}

define service {

use generic-service ; Name of service template to use

host_name DETAH-R1

service_description PING

contact_groups device-admins,DETAH-admins,router-admins

check_command check_ping!200,40%!1000,100%

}

Raffy

[Nagios-users] Nagios 'Out Of Memory' Problems

Reply via email to