I am having difficulty achieving a clean failover in a Pacemaker 1.0.7 cluster that is mainly there to run Xen virtual machines. I realize that nobody can tell me exactly what is wrong without seeing an awful lot of configuration detail; what I am looking for is more like some general methods I can use to debug this.
In a nutshell: if I manually stop all the Xen resources first with a command like "crm resource stop vmname"), then failover works perfectly, and restarting them all manually after a failover also works and everything appears to be running fine. However, if I just stop heartbeat on node1, then restart it, then the attempts to stop Xen resources on node2 (preparatory to moving them back to node1) all fail, resulting in a stonith of node2 from node1. node1 will start up all the resources, but when node2 reboots, the process repeats: attempts to stop the Xen resources on node1 fail, resulting in a stonith of node1 from node2. Kind of a delayed death match. The only way to break the cycle is to manually stop the Xen resources before bringing a recovered node back online. Stop works fine when invoked manually, but fails when invoked automatically as a result of an attempt to move resources back to a recovered node. I have already tried setting allow-migrate=false on all the Xen resource definitions just to eliminate one more complication until I can figure this out. Any ideas on how I can debug this? The HA logs don't seem to be terribly helpful, they only indicate that the stop operation failed but say nothing as to why it failed. --Greg _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
