I am having difficulty achieving a clean failover in a Pacemaker 1.0.7
cluster that is mainly there to run Xen virtual machines. I realize that
nobody can tell me exactly what is wrong without seeing an awful lot of
configuration detail; what I am looking for is more like some general
methods I can use to debug this.

In a nutshell: if I manually stop all the Xen resources first with a
command like "crm resource stop vmname"), then failover works perfectly,
and restarting them all manually after a failover also works and
everything appears to be running fine. However, if I just stop heartbeat
on node1, then restart it, then the attempts to stop Xen resources on
node2 (preparatory to moving them back to node1) all fail, resulting in
a stonith of node2 from node1. node1 will start up all the resources,
but when node2 reboots, the process repeats: attempts to stop the Xen
resources on node1 fail, resulting in a stonith of node1 from node2.
Kind of a delayed death match. The only way to break the cycle is to
manually stop the Xen resources before bringing a recovered node back
online. Stop works fine when invoked manually, but fails when invoked
automatically as a result of an attempt to move resources back to a
recovered node.

I have already tried setting allow-migrate=false on all the Xen resource
definitions just to eliminate one more complication until I can figure
this out.

Any ideas on how I can debug this? The HA logs don't seem to be terribly
helpful, they only indicate that the stop operation failed but say
nothing as to why it failed.

--Greg


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to