OK, after a second failover test this is now working. Here are some other issues we encountered and how to solve them...
1. The slaves would not reconnect to the master, even though the autossh process was attempting to do so. The issue here was that the slaves were attempting to connect back to the same IP (an external address which NAT's to the floating IP on the Opsview master nodes) and complaining that the host key had changed. A quick fix was to ensure that the master nodes use the same host key. Might not be the perfect solution but it works. 2. The master would not re-connect to the slaves, even though they had now re-established their tunnels correctly (as in point 1). The issue here is that the slave host keys were not in the nagios user's know_hosts file. It should contain HostKeyAlias entries for each slave it connects to but it didn't have these in. A quick fix was just to copy them from the primary master which did have them. Now to get this all automated by Heartbeat - then we're done. _______________________________________________ Opsview-users mailing list [email protected] http://lists.opsview.org/listinfo/opsview-users
