Laurent Neiger wrote:
We could check at regular intervals (<10s of ocfs2 timeout, let's say every 5 seconds for example) if the network comm between the 2 nodes is up. If not, on maq2, if network comm is still OK (checking ifconfig status, or pinging a third party such as a router), then maq2 is OK, and comm is lost between the 2 nodes because of maq1.
So on maq2, stop the ocfs2 heartbeat for avoiding self-fence, by using
ocfs2_hb_ctl -K -d /dev/drbd0 (please tell me if I misunderstood this command) and remote fence maq1 (if not a power supply failure, but a network card one for example,
we power off the bad node).

So our cluster will still continue to work in degraded mode, until we repair and power
up maq1, and restart o2cb and ocfs2 on both nodes.

So do you think doing that could be efficient for having a strong cluster or do you have
a better idea ?

Each of those pings will require a timeout - short timeouts. So short that you
may not even be able to distinguish between errors and overloaded run-queue,
transmit queue, router, etc. You will need an external hardware probes to
distinguish between slowdowns and errors.

Easy solution for your problem is to use net-bonding.

But then I guess you can rephrase the issue with some other precise hardware
error that allows the node to run as a single node but not in cluster. And what
if that node is the lower number.

In the end, you have to have shutdown windows. Windows in which you can recyle the cluster. There is a reason people talk about 99.999% uptime and not 100%. ;)



_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to