On Sun, Nov 4, 2012 at 1:34 PM, Mike Julian <[email protected]> wrote:
> Hey all,
>
> I've been racking my brain on how to build a highly-available Nagios
> infrastructure. Our monitoring is critical to our business, and only having
> it on one system is certainly a point of failure.

Yep.

> The officially-accepted method for doing HA with Nagios is two instances
> running on the same configuration, both doing the checks, but with
> notifications turned off on the second one. One downside to this is that it
> generates double the traffic due to doing all the checks twice.

Relative to your non-monitoring data traffic, is this really such a
big deal?  ICMP and SNMP traffic is pretty small.  SSH/HTTP probes are
relatively small as well.  So yes, there's duplication of the checks,
but does it really matter?


> The other downside, which is the more important one: if BoxA goes down and
> we turn on notifications for BoxB, then when BoxA comes back up, we have to
> work out some method to bring BoxA's performance data back to current, since
> there is a gap now.

Correct.  This is an issue with Active/Passive setups, unless you
somehow share the data.


> Moving away from Nagios isn't a viable option at this point in time either.
>
> Any of you doing a highly-available Nagios environment?

No.

But three ideas come to mind:

1) don't try to make the monitoring server HA, but make the system on
which it runs HA.  If you virtualize the Nagios server, then you can
work on making the VM hardware fault tolerant.  Hot-migration (i.e.
immediate, stateful failover) isn't totally supported, although VMWare
can do it in some cases.  Warm and cold migrations certainly are
supported though.  Anyway, just a thought.

2) Sync the storage to the active.  Nagios keeps state in flatfiles
and rrd files.  You could copy those to the backup server periodically
(every minute?).

3) Store the data on shared backend that is very fault tolerant (any
decent NAS storage solution will do this), and provide NFS failover.
When the Primary Nagios system fails, the secondary need only start
the processes; the logs and state information are already there,
because of the shared storage backend.



-- 
Jesse Becker
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to