On 19 May 2010, at 14:21, Toni Van Remortel wrote:
I upgraded my Opsview yesterday to 3.7.0
After I made my changes in the contacts (merging all separate
‘contacts’ for each person into 1 contact with some profiles
connected to it), the reload of the entire system went from 3
minutes 50 seconds to almost 30 minutes.
A 10x increase is very extreme. Can you send the output of /usr/local/
nagios/var/log/create_and_send_configs.debug?
When I watch the processes on the master server, I do see the
nagconfgen.pl scripts go on high speed, followed by a minute of 100%
cpu usage by the Nagios process. Then it goes quit on the master
server, but the reload indicator states it is still busy.
This is opsviewd.log:
[2010/05/19 14:59:54] [slave_node_event_handler] [INFO] Starting
[2010/05/19 14:59:54] [slave_node_event_handler] [INFO] Only running
on HARD state change - currently SOFT
[2010/05/19 14:59:54] [slave_node_event_handler] [INFO] Finished
[2010/05/19 15:00:28] [slave_node_event_handler] [INFO] Starting
[2010/05/19 15:00:28] [slave_node_event_handler] [INFO] Only running
on HARD state change - currently SOFT
[2010/05/19 15:00:28] [slave_node_event_handler] [INFO] Finished
[2010/05/19 15:00:52] [slave_node_event_handler] [INFO] Starting
[2010/05/19 15:00:52] [slave_node_event_handler] [INFO] Only running
on HARD state change - currently SOFT
[2010/05/19 15:00:52] [slave_node_event_handler] [INFO] Finished
[2010/05/19 15:01:26] [slave_node_event_handler] [INFO] Starting
[2010/05/19 15:01:26] [slave_node_event_handler] [INFO] Only running
on HARD state change - currently SOFT
[2010/05/19 15:01:26] [slave_node_event_handler] [INFO] Finished
[2010/05/19 15:01:54] [slave_node_event_handler] [INFO] Starting
[2010/05/19 15:01:54] [slave_node_event_handler] [INFO] Only running
when OK - state is currently CRITICAL
[2010/05/19 15:01:54] [slave_node_event_handler] [INFO] Finished
[2010/05/19 15:02:16] [slave_node_event_handler] [INFO] Starting
[2010/05/19 15:02:16] [slave_node_event_handler] [INFO] Only running
when OK - state is currently CRITICAL
[2010/05/19 15:02:16] [slave_node_event_handler] [INFO] Finished
[2010/05/19 15:04:02] [import_runtime] [INFO] Starting
[2010/05/19 15:04:02] [import_runtime] [INFO] Importing for
2010-05-19 12:00:00
[2010/05/19 15:04:02] [import_runtime] [INFO] Importing all results
and performance data
[2010/05/19 15:04:24] [import_runtime] [INFO] Importing downtime
starts
[2010/05/19 15:04:24] [import_runtime] [INFO] Importing downtime ends
[2010/05/19 15:04:24] [import_runtime] [INFO] Checking for incorrect
downtimes
[2010/05/19 15:04:24] [import_runtime] [INFO] Caculating relevant
downtimes
[2010/05/19 15:04:24] [import_runtime] [INFO] Importing notifications
[2010/05/19 15:04:24] [import_runtime] [INFO] Importing
acknowledgements
[2010/05/19 15:04:24] [import_runtime] [INFO] Importing state history
[2010/05/19 15:04:25] [import_runtime] [INFO] Calculating hourly
availability
[2010/05/19 15:04:31] [import_runtime] [INFO] Finished import for hour
[2010/05/19 15:04:32] [import_runtime] [INFO] Finished
[2010/05/19 15:14:06] [create_and_send_configs] [INFO] Ending
overall with error=0
There's an import into ODW in the middle (import_runtime) which is
irrelevant. There's lots of event handler calls to
slave_node_event_handler - I assume this is during the transfer. I'm
guessing this means that the Slave-node checks to the slaves are
having an issue. Is the line between the master and the slave
saturated? Lots of errors causing re-transfers?
I guess this is because the config files for the contacts are now
huge:
-rw-r----- 1 nagios nagios 2.2M 2010-05-19 14:47 contactgroups.cfg
-rw-r----- 1 nagios nagios 13M 2010-05-19 14:47 contacts.cfg
Did you make any other changes such as adding new host groups or
adding new service groups? I was expecting that 3.7.0 would decrease
the size of the contacts and contactgroup pages (because we've removed
all the distprofile and masterprofile configurations).
Yes my slaves are reachable over slow lines, that’s the idea of a
slave.
How are these configs copied to the slave? Entire copy with scp?
Wouldn’t rsync be much much better? After all, most changes are
small in between reloads.
We compress the config files and then we send using scp, then there's
a post job on the other end which extracts and places accordingly.
We could switch to rsync but that would involve quite a bit of
development work (there's some post processing that we do as well for
slave node specific configurations).
Ton
_______________________________________________
Opsview-users mailing list
[email protected]
http://lists.opsview.org/lists/listinfo/opsview-users