On 25/11/15 15:22, Jonathan Davies wrote: > Hi, > > I'm experimenting with corosync+dlm+gfs2 (approximately following > http://people.redhat.com/teigland/cluster4-gfs2-dlm.txt) and am trying > to establish whether it meets my requirements. I have a query about a > node rejoining a cluster after failure, and want to make sure I'm not > overlooking something. > > I have a three-node cluster and deliberately cause token loss by > firewalling one of them (call it node A) out of the network for longer > than the token timeout. At this point, the other two hosts (B and C) > decide that A has disappeared and continue with quorum. That is fine. > > When I unfirewall node A, dlm tries to reconnect to its peers on B and > C. But then I see the following on host B: > > 16:29:25.823496 nodeb dlm_controld[6548]: 908 daemon node 85 stateful merge > 16:29:25.823529 nodeb dlm_controld[6548]: 908 daemon node 85 kill due to > stateful merge > 16:29:25.823543 nodeb dlm_controld[6548]: 908 tell corosync to remove > nodeid 85 from cluster > 16:29:25.823696 nodeb corosync[6536]: [CFG ] request to kill node > 85(us=83): xxx > > and then the following on node A: > > 16:29:25.828547 nodea corosync[3896]: [CFG ] Killed by node 83: > dlm_controld > 16:29:25.828575 nodea corosync[3896]: [MAIN ] Corosync Cluster Engine > exiting with status -1 at cfg.c:530. > 16:29:25.834828 nodea dlm_controld[3466]: 1183 process_cluster_cfg > cfg_dispatch 2 > 16:29:25.834871 nodea dlm_controld[3466]: 1183 cluster is down, exiting > 16:29:25.834886 nodea dlm_controld[3466]: 1183 process_cluster > quorum_dispatch 2 > 16:29:25.834903 nodea dlm_controld[3466]: 1183 daemon cpg_dispatch error 2 > 16:29:25.834917 nodea dlm_controld[3466]: 1183 cpg_dispatch error 2 > 16:29:25.837152 nodea dlm_controld[3466]: 1183 abandoned lockspace mygfs2 > > resulting in both corosync and dlm_controld exiting on node A. > > Later, if I try to manually restart corosync and dlm on node A, I see > the following: > > 16:32:08.382871 nodea dlm_controld[20483]: 2872 dlm_controld 4.0.2 started > 16:32:08.392453 nodea dlm_controld[20483]: 2872 found uncontrolled > lockspace mygfs2 > 16:32:08.392477 nodea dlm_controld[20483]: 2872 tell corosync to remove > nodeid 85 from cluster > 16:32:08.394965 nodea corosync[20456]: [CFG ] request to kill node > 85(us=85): xxx > 16:32:08.394998 nodea corosync[20456]: [CFG ] Killed by node 85: > dlm_controld > > The only way of making A rejoin the cluster is to reboot. >
Yes. You need to implement fencing, so that the node will automatically be restarted when it leaves the cluster. CHrissie > I would be grateful if you could confirm the following statements: > (a) The "stateful merge" is unavoidable when node A leaves the cluster > for longer than the token timeout then tries to rejoin. > (b) Killing corosync on node A is unavoidable when node B sees the > "stateful merge". > (c) dlm exiting is unavoidable when corosync dies. > (d) Restarting corosync then dlm on node A will necessarily result in > "found uncontrolled lockspace". > (e) The only way to recover from "found uncontrolled lockspace" (for a > gfs2 lockspace) is to reboot. > > I'm hoping that I'm overlooking something and that at least one of > (a)--(e) is false! I'm not comfortable with a reboot being the only > means of recovery when the token timeout is exceeded. > > Thanks, > Jonathan > -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster