|
I have a problem with my Nagios server constantly crashing.
It keeps outputting on the screen Out of Memory errors which causes loss of
access to the server. I can ping the box but I cannot SSH or web into it to
view any information. This has been happening increasingly more lately. Now it
is about every 2-3 days that this is occurring. We have been adding more and
more devices to the servers and this problem has been increasing as this
occurs. This is how I have it set up. I have a Main Nagios server that is running the latest 2.0
(stable) Nagios release. It is monitoring about 6800 devices but it is not
actively checking the devices. Its main role is to provide a web interface and
receive passive polls from three other servers which do the polling. The main
server also does email notifications when a device goes down. The server sends
about 30-40 emails a day. I am using NSCA 2.5 between the server and the client
Nagios servers. I am only monitoring one service for each device which is
either TCP or ping depending on the device. Mostly all devices are monitored
with TCP (roughly 6000). The rest are monitored with ping. The individual
servers are pretty evenly spread with the number of devices. They are about
2000-2500 each. All the servers are just basic computers, Dell Dimension
2400s with base hardware. The main server was upgraded to 2GB RAM while the
other servers are running 512MB each. They are all running Celeron 2.4 GHz
processors. The individual servers are not having out of memory problems and
they are running the latest 2.0 (stable) release as well. They all run RedHat
9.0 with everything installed for the packages. Can someone please help me in resolving this problem?
Thanks. The TOP process does not appear like it is running out of
memory. This is the normal output when the server has been running for a few
hours. 57 processes: 54 sleeping, 3 running, 0 zombie, 0 stopped CPU states: 41.1% user 58.8% system
0.0% nice 0.0% iowait 0.0% idle Mem: 2063556k av, 285940k used, 1777616k free,
0k shrd, 41056k buff
177644k actv, 51688k in_d, 10892k in_c Swap: 1044184k av, 0k
used, 1044184k
free
114208k cached Here is a sample configuration that I have on the devices on
the main server: hosts.cfg define host { name
generic-host ; The name of this host template -
referenced in other host definitions, used for template recursion/resolution notifications_enabled
1 ; Host notifications are enabled event_handler_enabled
0 ; Host event handler is enabled flap_detection_enabled
1 ; Flap detection is enabled process_perf_data
1 ; Process performance data retain_status_information
1 ; Retain status information across
program restarts retain_nonstatus_information
1 ; Retain non-status information
across program restarts max_check_attempts
10 notification_interval
720 notification_period
24x7 obsess_over_host
0 notification_options
d,u,r,f register
0 ; DONT REGISTER THIS DEFINITION -
ITS NOT A REAL HOST, JUST A TEMPLATE! } define host { use
generic-host ; Name of
host template to use host_name
DETAH-R1 alias
DETAH-R1 address
x.x.x.x check_command
check_ping!200,40%!10000,100% contact_groups
device-admins,DETAH-admins,router-admins } services.cfg define service { name
generic-service ; The 'name' of this service template, referenced in
other service definitions active_checks_enabled
0 ; Active service checks are enabled passive_checks_enabled
1 ; Passive service checks are
enabled/accepted parallelize_check
1 ; Active service checks should be
parallelized (disabling this can lead to major performance problems) obsess_over_service
0 ; We should
obsess over this service (if necessary) check_freshness
1 ; Default is to NOT check service
'freshness' freshness_threshold
1800 notifications_enabled
1 ; Service notifications are enabled event_handler_enabled
0 ; Service event handler is enabled flap_detection_enabled
1 ; Flap detection is enabled process_perf_data
1 ; Process performance data retain_status_information
1 ; Retain status information across
program restarts retain_nonstatus_information
1 ; Retain non-status information
across program restarts is_volatile
0 check_period
24x7 max_check_attempts
6 normal_check_interval
20 retry_check_interval
5 notification_interval
720 notification_period
24x7 notification_options
n register
0 ; DONT REGISTER THIS DEFINITION -
ITS NOT A REAL SERVICE, JUST A TEMPLATE! } define service { use
generic-service ; Name of
service template to use host_name
DETAH-R1 service_description
contact_groups
device-admins,DETAH-admins,router-admins check_command
check_ping!200,40%!1000,100% } Here is a sample config on the individual server. hosts.cfg define host { name
generic-host ; The name of this host template -
referenced in other host definitions, used for template recursion/resolution notifications_enabled
1 ; Host notifications are enabled event_handler_enabled
0 ; Host event handler is enabled flap_detection_enabled
1 ; Flap detection is enabled process_perf_data
1 ; Process performance data retain_status_information
1 ; Retain status information across
program restarts retain_nonstatus_information
1 ; Retain non-status information
across program restarts max_check_attempts
10 notification_interval
720 notification_period
24x7 obsess_over_host
0 notification_options
d,u,r,f register 0
; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE! } define host { use
generic-host ; Name of
host template to use host_name
DETAH-R1 alias DETAH-R1 address
x.x.x.x check_command
check_ping!200,40%!10000,100% contact_groups
device-admins,DETAH-admins,router-admins } services.cfg define service { name
generic-service ; The 'name' of this service template, referenced in
other service definitions active_checks_enabled
1 ; Active service checks are enabled passive_checks_enabled
1 ; Passive service checks are
enabled/accepted parallelize_check
1 ; Active service checks should be
parallelized (disabling this can lead to major performance problems) obsess_over_service
1 ; We should obsess over this
service (if necessary) check_freshness
1 ; Default is to NOT check service
'freshness' freshness_threshold
1800 notifications_enabled
1 ; Service notifications are enabled event_handler_enabled
0 ; Service event handler is enabled flap_detection_enabled
1 ; Flap detection is enabled process_perf_data
1 ; Process performance data retain_status_information
1 ; Retain status information across
program restarts retain_nonstatus_information
1 ; Retain non-status information
across program restarts is_volatile
0 check_period
24x7 max_check_attempts
6 normal_check_interval
20 retry_check_interval 5 notification_interval
720 notification_period
24x7 notification_options
n register
0 ; DONT REGISTER THIS DEFINITION -
ITS NOT A REAL SERVICE, JUST A TEMPLATE! } define service { use
generic-service ; Name of
service template to use host_name
DETAH-R1 service_description
contact_groups
device-admins,DETAH-admins,router-admins check_command check_ping!200,40%!1000,100% } Raffy |
- [Nagios-users] Nagios 'Out Of Memory' Problems Armistead, Raffy
- RE: [Nagios-users] Nagios 'Out Of Memory' Problems Marc Powell
- RE: [Nagios-users] Nagios 'Out Of Memory' Problems Armistead, Raffy
- RE: [Nagios-users] Nagios 'Out Of Memory' Proble... Marco Ramos
- Re: [Nagios-users] Nagios 'Out Of Memory' Pr... Stephen Barron
- Re: [Nagios-users] Nagios 'Out Of Memory... Florian Gleixner
- RE: [Nagios-users] Nagios 'Out Of Memory' Problems Armistead, Raffy
- RE: [Nagios-users] Nagios 'Out Of Memory' Problems Armistead, Raffy
