On 12/04/2015 12:49 PM, Digimer wrote:
On 04/12/15 09:14 AM, Kelvin Edmison wrote:
On 12/03/2015 09:31 PM, Digimer wrote:
On 03/12/15 08:39 PM, Kelvin Edmison wrote:
On 12/03/2015 06:14 PM, Digimer wrote:
On 03/12/15 02:19 PM, Kelvin Edmison wrote:
I am hoping that someone can help me understand the problems I'm
having
with linux clustering for VMs.
I am clustering 2 VMs on two separate VM hosts, trying to ensure
that a
service is always available. The hosts and guests are both RHEL 6.7.
The goal is to have only one of the two VMs running at a time.
The configuration works when we test/simulate VM deaths and
graceful VM
host shutdowns, and administrative switchovers (i.e. clusvcadm -r ).
However, when we simulate the sudden isolation of host A (e.g. ifdown
eth0), two things happen
1) the VM on host B does not start, and repeated fence_xvm errors
appear
in the logs on host B
2) when the 'failed' node is returned to service, the cman service on
host B dies.
If the node's host is dead, then there is no way for the survivor to
determine the state of the lost VM node. The cluster is not allowed to
take "no answer" as confirmation of fence success.
If your hosts have IPMI, then you could add fence_ipmilan as a backup
method where, if fence_xvm fails, it moves on and reboots the host
itself.
Thank you for the suggestion. The hosts do have ipmi. I'll explore it
but I'm a little concerned about what it means for the other
non-clustered VM workloads that exist on these two servers.
Do you have any thoughts as to why host B's cman process is dying when
'host A' returns?
Thanks,
Kelvin
It's not dieing, it's blocking. When a node is lost, dlm blocks until
fenced tells it that the fence was successful. If fenced can't contact
the lost node's fence method(s), then it doesn't succeed and dlm stays
blocked. To anything that uses DLM, like rgmanager, it appears like the
host is hung but it is by design. The logic is that, as bad as it is to
hang, it's better than risking a split-brain.
when I said the cman service is dying, I should have further qualified
it. I mean that the corosync process is no longer running (ps -ef | grep
corosync does not show it) and after recovering the failed host A,
manual intervention (service cman start) was required on host B to
recover full cluster services.
[root@host2 ~]# for SERVICE in ricci fence_virtd cman rgmanager; do
printf "%-12s " $SERVICE; service $SERVICE status; done
ricci ricci (pid 5469) is running...
fence_virtd fence_virtd (pid 4862) is running...
cman Found stale pid file
rgmanager rgmanager (pid 5366) is running...
Thanks,
Kelvin
Oh now that is interesting...
You'll want input from Fabio, Chrissie or one of the other core devs, I
suspect.
If this is RHEL proper, can you open a rhbz ticket? If it's CentOS, and
if you can reproduce it reliably, can you create a new thread with the
reproducer?
It's RHEL proper in both host and guest, and we can reproduce it reliably.
--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster