On 4/27/2015 1:28 PM, Vasil Valchev wrote:
Hi,
I would advise you to use quorum disk _only_ as a last resort - it's
better to first get a solid understanding of the clustering solution
before adding additional complexity.
An amazingly thorough and well described tutorial you can find here:
https://alteeve.ca/w/AN!Cluster_Tutorial_2
<https://alteeve.ca/w/AN%21Cluster_Tutorial_2>
[Jatin] Thank you very much for sharing this tutorial. I will surely go
through it and gain more understanding.
Especially useful are the first chapters - the theory.
What I suspect is happening in your case is that your cluster
communication and fencing are over the same network, which is not
fault tolerant.
[Jatin]
My cluster communication happens over one network while fencing happens
over other network. I use two seperate vlans for this purpose. Secondly
when the cluster communication fails due to network outage then fencing
happens over the other vlan and both the nodes get fenced.
So what happens if this network fails? Your 2 nodes can't see each
other, so they send fence requests, but the fence devices are
unreachable too, so those requests fail.
They are retried a few times I think, but if all fail, the fence agent
returns failed and your cluster is stuck in "recovering" or stopped state.
Other times the network outage is shorter and the fence succeeds,
resulting in both nodes going down - this is solved with the delay
parameter.
The first issue is architectural one, it is the expected behavior of
the cluster to stop (or "freeze") all resources if it can't guarantee
the state of all members.
Read the article above it's really very useful.
Cheers!
Thanks
Jatin
--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster