Hello to all!
I'm working very well with nagios since 2006.
Meanwhile our company has increased the monitored devices day by day. Today our
monitoring system is composed as follow:
1 cluster (2 nodes Xeon 2GB RAM) in HA with heartbeat, also one node per time is working. The other one is there in case
of failure of the first.
- CentOS 5.4
- nagios 3.2.0 (monarch as web GUI for conf)
Actually it monitors 800 hosts and 2000 services with following stats:
Metric Min. Max. Average
Check Execution Time: 0.00 sec 15.05 sec 2.681 sec
Check Latency: 0.00 sec 12.09 sec 0.785 sec
Percent State Change: 0.00% 12.11% 0.18%
There are some distributed installations that reports some status back to the
core via NSCA.
All this is set manually, with only (great) help of monarch.
In the next months we need to merge another big monitoring system that will
groove up the numbers a lot.
Maybe at the end we should monitor 2000 hosts and 13,000 services. This could be a problem for my Xeon Server and I'll
need new hardware. This could be a good moment to start using a distributed solution to reduce single load.
The native distribution solution (NSCA) could be good but has two big
limitations:
1. need to maintain duplicate configuration across different nagios
installations
2. in case of down of one distributed nagios server, all checks done by the
remote server would be considered CRITICAL.
Googling, first I've found is DNX, but official it DOESN'T support a distributed solution with some servers in DMZ. It
means, I've some Nagios (distributed-slaves) that are behind firewall and can reach ONLY the devices behind that
firewall. The check-results must be send (nsca) to master to send out sms notifications and collect logs (for SLA) and
that slaves must check ONLY the devices in that LAN. DNX (official and now) doesn't support a selective allocation of
devices to be monitored from single slaves only, but they are distributed in load balancing everywhere.
And here comes Opsview Community Edition! It could help me to extend my
installation... I hope :)
Also I've some questions, hope someone can help me:
1. Is it compatible AND stable with nagios 3.2.0? It means, can I still import all configuration AND ALL LOGS to the new
system?
2. How many hosts/services could I manage from central core with this solution? Some production scenmario examples in
numbers.
3. It's right what I've understood reading the documentation?
There is one (or more in heartbeat active/passive) master that collect all infos and sends out notifications (mail, sms
gsm modem or whatever I like)
It's possible to add two different types of slave servers:
a) single slave installations. Every single slave is handled as a separate datacenter. This is the solution that could
be used to monitor devices not directly reachable from master. Status is sent to master back with nsca.
b) multi slaves in cluster. They are handled, like above, as a separate datacenter, but the checks in this slave cluster
are divided between all slaves in load balancing as well as in high availability. If one slave dies, the others take the
services/hosts to be monitored.
4. What happens if slave fails? I mean a slave like point 3a, also a slave with
no clusters.
5. Can the master do active checks too or in case of slaves it demands the
checks on slaves only?
6. Can I still use our custom plugins? They are bash scripts that perfoms check based on nagios macros (HOSTADDRESS,
...) and give back as attended a message and an exit code, very simple.
I know I've asked maybe a lot, but up to there answers I'll start some tests.
Thank's a lot for your help!
Warmest Regards,
Simon
_______________________________________________
Opsview-users mailing list
[email protected]
http://lists.opsview.org/lists/listinfo/opsview-users