On Fri, Mar 11, 2011 at 8:13 AM, Uwe Ritzschke <uwe.ritzschk...@cms.hu-berlin.de> wrote: > I'm currently testing fail-over with a two-node active-active cluster (with > node dig and node dag): Both nodes are up, one is manually killed. CTDB on > the node that's still alive should perform a recovery and everything should > working again. > > What's infrequently happening is: > > After killing the pacemaker-process on dag (and dag consequently being > fenced), dig's CTDB tries to get the recovery lock and fails. As there is no > other node online to get the recovery lock and thus finishing CTDB's > recovery, dig's CTDB keeps trying to get the recovery lock until manually > stopped. > The only way to get CTDB back to work is to restart OCFS2's distributed lock > manager. > > > Our setting: > > two nodes directly connected via LAN running openSuse 11.3 and sharing a > SAN-drive that is connected via two interfaces using multipath. > > pacemaker 1.1.2 > corosync 1.2.1 > cluster-glue 1.0.5-1.4 > ctdb 1.0.114-2.20 > ocfs2 1.4.3-1.4 > multipath 0.4.8-51.3 > You might want to try updated packages from the repository: http://download.opensuse.org/repositories/network:/ha-clustering/openSUSE_11.3/
This would give you newer code levels on the HA packages. -- Jim McDonough Samba Team SUSE labs jmcd at samba dot org jmcd at themcdonoughs dot org -- To unsubscribe from this list go to the following URL and read the instructions: https://lists.samba.org/mailman/options/samba