I have a problem with my Nagios server constantly crashing. It keeps outputting on the screen Out of Memory errors which causes loss of access to the server. I can ping the box but I cannot SSH or web into it to view any information. This has been happening increasingly more lately. Now it is about every 2-3 days that this is occurring. We have been adding more and more devices to the servers and this problem has been increasing as this occurs. This is how I have it set up.

 

I have a Main Nagios server that is running the latest 2.0 (stable) Nagios release. It is monitoring about 6800 devices but it is not actively checking the devices. Its main role is to provide a web interface and receive passive polls from three other servers which do the polling. The main server also does email notifications when a device goes down. The server sends about 30-40 emails a day. I am using NSCA 2.5 between the server and the client Nagios servers. I am only monitoring one service for each device which is either TCP or ping depending on the device. Mostly all devices are monitored with TCP (roughly 6000). The rest are monitored with ping. The individual servers are pretty evenly spread with the number of devices. They are about 2000-2500 each.

 

All the servers are just basic computers, Dell Dimension 2400s with base hardware. The main server was upgraded to 2GB RAM while the other servers are running 512MB each. They are all running Celeron 2.4 GHz processors. The individual servers are not having out of memory problems and they are running the latest 2.0 (stable) release as well. They all run RedHat 9.0 with everything installed for the packages.

 

Can someone please help me in resolving this problem? Thanks.

 

 

 

 

 

 

The TOP process does not appear like it is running out of memory. This is the normal output when the server has been running for a few hours.

57 processes: 54 sleeping, 3 running, 0 zombie, 0 stopped

CPU states:  41.1% user  58.8% system   0.0% nice   0.0% iowait   0.0% idle

Mem:  2063556k av,  285940k used, 1777616k free,       0k shrd,   41056k buff

                    177644k actv,   51688k in_d,   10892k in_c

Swap: 1044184k av,       0k used, 1044184k free                  114208k cached

 

 

 

Here is a sample configuration that I have on the devices on the main server:

 

hosts.cfg

define host {

name                           generic-host     ; The name of this host template - referenced in other host definitions, used for template recursion/resolution

notifications_enabled          1        ; Host notifications are enabled

event_handler_enabled          0        ; Host event handler is enabled

flap_detection_enabled         1        ; Flap detection is enabled

process_perf_data              1        ; Process performance data

retain_status_information      1        ; Retain status information across program restarts

retain_nonstatus_information   1        ; Retain non-status information across program restarts

max_check_attempts             10

notification_interval          720

notification_period            24x7

obsess_over_host               0

notification_options           d,u,r,f

register                       0        ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!

}

define host {

use                            generic-host          ; Name of host template to use

host_name                      DETAH-R1

alias                          DETAH-R1

address                        x.x.x.x

check_command                  check_ping!200,40%!10000,100%

contact_groups                 device-admins,DETAH-admins,router-admins

}

 

services.cfg

define service {

name                           generic-service  ; The 'name' of this service template, referenced in other service definitions

active_checks_enabled          0        ; Active service checks are enabled

passive_checks_enabled         1        ; Passive service checks are enabled/accepted

parallelize_check              1        ; Active service checks should be parallelized (disabling this can lead to major performance problems)

obsess_over_service            0        ; We should obsess over this service (if necessary)

check_freshness                1        ; Default is to NOT check service 'freshness'

freshness_threshold            1800

notifications_enabled          1        ; Service notifications are enabled

event_handler_enabled          0        ; Service event handler is enabled

flap_detection_enabled         1        ; Flap detection is enabled

process_perf_data              1        ; Process performance data

retain_status_information      1        ; Retain status information across program restarts

retain_nonstatus_information   1        ; Retain non-status information across program restarts

is_volatile                    0

check_period                   24x7

max_check_attempts             6

normal_check_interval          20

retry_check_interval           5

notification_interval          720

notification_period            24x7

notification_options           n

register                       0        ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!

}

define service {

use                            generic-service          ; Name of service template to use

host_name                      DETAH-R1

service_description            PING

contact_groups                 device-admins,DETAH-admins,router-admins

check_command                  check_ping!200,40%!1000,100%

}

 

Here is a sample config on the individual server.

 

hosts.cfg

define host {

name                           generic-host     ; The name of this host template - referenced in other host definitions, used for template recursion/resolution

notifications_enabled          1        ; Host notifications are enabled

event_handler_enabled          0        ; Host event handler is enabled

flap_detection_enabled         1        ; Flap detection is enabled

process_perf_data              1        ; Process performance data

retain_status_information      1        ; Retain status information across program restarts

retain_nonstatus_information   1        ; Retain non-status information across program restarts

max_check_attempts             10

notification_interval          720

notification_period            24x7

obsess_over_host               0

notification_options           d,u,r,f

register                       0        ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!

}

define host {

use                            generic-host          ; Name of host template to use

host_name                      DETAH-R1

alias                          DETAH-R1

address                        x.x.x.x

check_command                  check_ping!200,40%!10000,100%

contact_groups                 device-admins,DETAH-admins,router-admins

}

 

services.cfg

define service {

name                           generic-service  ; The 'name' of this service template, referenced in other service definitions

active_checks_enabled          1        ; Active service checks are enabled

passive_checks_enabled         1        ; Passive service checks are enabled/accepted

parallelize_check              1        ; Active service checks should be parallelized (disabling this can lead to major performance problems)

obsess_over_service            1        ; We should obsess over this service (if necessary)

check_freshness                1        ; Default is to NOT check service 'freshness'

freshness_threshold            1800

notifications_enabled          1        ; Service notifications are enabled

event_handler_enabled          0        ; Service event handler is enabled

flap_detection_enabled         1        ; Flap detection is enabled

process_perf_data              1        ; Process performance data

retain_status_information      1        ; Retain status information across program restarts

retain_nonstatus_information   1        ; Retain non-status information across program restarts

is_volatile                    0

check_period                   24x7

max_check_attempts             6

normal_check_interval          20

retry_check_interval           5

notification_interval          720

notification_period            24x7

notification_options           n

register                       0        ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!

}

define service {

use                            generic-service          ; Name of service template to use

host_name                      DETAH-R1

service_description            PING

contact_groups                 device-admins,DETAH-admins,router-admins

check_command                  check_ping!200,40%!1000,100%

}

 

Raffy

 

Reply via email to