[opsview-users] Opsview: our next step for monitoring evolution?

Simone Felici Fri, 19 Feb 2010 08:36:16 -0800


Hello to all!


I'm working very well with nagios since 2006.
Meanwhile our company has increased the monitored devices day by day. Today our 
monitoring system is composed as follow:

1 cluster (2 nodes Xeon 2GB RAM) in HA with heartbeat, also one node per time is working. The other one is there in caseof failure of the first.

- CentOS 5.4
- nagios 3.2.0 (monarch as web GUI for conf)
Actually it monitors 800 hosts and 2000 services with following stats:
Metric            Min.        Max.        Average
Check Execution Time:    0.00 sec    15.05 sec    2.681 sec
Check Latency:        0.00 sec    12.09 sec    0.785 sec
Percent State Change:    0.00%        12.11%        0.18%

There are some distributed installations that reports some status back to the 
core via NSCA.
All this is set manually, with only (great) help of monarch.

In the next months we need to merge another big monitoring system that will 
groove up the numbers a lot.

Maybe at the end we should monitor 2000 hosts and 13,000 services. This could be a problem for my Xeon Server and I'llneed new hardware. This could be a good moment to start using a distributed solution to reduce single load.

The native distribution solution (NSCA) could be good but has two big 
limitations:
1. need to maintain duplicate configuration across different nagios 
installations
2. in case of down of one distributed nagios server, all checks done by the 
remote server would be considered CRITICAL.

Googling, first I've found is DNX, but official it DOESN'T support a distributed solution with some servers in DMZ. Itmeans, I've some Nagios (distributed-slaves) that are behind firewall and can reach ONLY the devices behind thatfirewall. The check-results must be send (nsca) to master to send out sms notifications and collect logs (for SLA) andthat slaves must check ONLY the devices in that LAN. DNX (official and now) doesn't support a selective allocation ofdevices to be monitored from single slaves only, but they are distributed in load balancing everywhere.


And here comes  Opsview Community Edition! It could help me to extend my 
installation... I hope :)
Also I've some questions, hope someone can help me:

1. Is it compatible AND stable with nagios 3.2.0? It means, can I still import all configuration AND ALL LOGS to the newsystem?

2. How many hosts/services could I manage from central core with this solution? Some production scenmario examples innumbers.


3. It's right what I've understood reading the documentation?

There is one (or more in heartbeat active/passive) master that collect all infos and sends out notifications (mail, smsgsm modem or whatever I like)

It's possible to add two different types of slave servers:

a) single slave installations. Every single slave is handled as a separate datacenter. This is the solution that couldbe used to monitor devices not directly reachable from master. Status is sent to master back with nsca.b) multi slaves in cluster. They are handled, like above, as a separate datacenter, but the checks in this slave clusterare divided between all slaves in load balancing as well as in high availability. If one slave dies, the others take theservices/hosts to be monitored.


4. What happens if slave fails? I mean a slave like point 3a, also a slave with 
no clusters.

5. Can the master do active checks too or in case of slaves it demands the 
checks on slaves only?

6. Can I still use our custom plugins? They are bash scripts that perfoms check based on nagios macros (HOSTADDRESS,...) and give back as attended a message and an exit code, very simple.


I know I've asked maybe a lot, but up to there answers I'll start some tests.
Thank's a lot for your help!

Warmest Regards,

Simon
_______________________________________________
Opsview-users mailing list
[email protected]
http://lists.opsview.org/lists/listinfo/opsview-users

[opsview-users] Opsview: our next step for monitoring evolution?

Reply via email to