I observe strange problems with fencing when a cluster loose quorum for a short time.
After regain quorum, fenced reports 'wait state messages', and whole cluster is blocked waiting for fenced. I can reproduce that bug here easily. It always happens with the following test: Software: RHEL6.3 based kernel, corosync 1.4.4, cluster-3.1.93 I have 4 nodes. node hp4 is turned off for this test: hp2:~# cman_tool nodes Node Sts Inc Joined Name 1 X 0 hp4 2 M 1232 2012-10-03 08:59:08 hp1 3 M 1228 2012-10-03 08:58:58 hp3 4 M 1220 2012-10-03 08:58:58 hp2 hp2:~# fence_tool ls fence domain member count 3 victim count 0 victim now 0 master nodeid 3 wait state none members 2 3 4 Everything runs fine so far (fence_tool ls output match on all nodes). Now I unplug the network cable on hp1: hp2:~# cman_tool nodes Node Sts Inc Joined Name 1 X 0 hp4 2 X 1232 hp1 3 M 1228 2012-10-03 08:58:58 hp3 4 M 1220 2012-10-03 08:58:58 hp2 hp2:~# fence_tool ls fence domain member count 2 victim count 1 victim now 0 master nodeid 3 wait state quorum members 2 3 4 Same output on hp3 - so far so good . In the fenced log I can find the following entries: hp2:~# cat /var/log/cluster/fenced.log Oct 03 08:59:08 fenced fenced 1349169030 started Oct 03 08:59:09 fenced fencing deferred to hp3 on hp3: hp3:~# cat /var/log/cluster/fenced.log Oct 03 08:57:12 fenced fencing node hp4 Oct 03 08:57:21 fenced fence hp4 success hp2:~# dlm_tool ls dlm lockspaces name rgmanager id 0x5231f3eb flags 0x00000004 kern_stop change member 3 joined 1 remove 0 failed 0 seq 2,2 members 2 3 4 new change member 2 joined 0 remove 1 failed 1 seq 3,3 new status wait_messages 0 wait_condition 1 fencing new members 3 4 same output on hp3. Now I reconnect the network on hp1: # cman_tool nodes Node Sts Inc Joined Name 1 X 0 hp4 2 M 1240 2012-10-03 09:07:41 hp1 3 M 1228 2012-10-03 08:58:58 hp3 4 M 1220 2012-10-03 08:58:58 hp2 So we have quorum again. hp2:~# fence_tool ls fence domain member count 3 victim count 1 victim now 0 master nodeid 3 wait state messages members 2 3 4 same output on hp3, hp1 is different: hp1:~# fence_tool ls fence domain member count 3 victim count 2 victim now 0 master nodeid 3 wait state messages members 2 3 4 Here are the fenced dumps - maybe someone can see what is wrong here? hp2:~# fence_tool dump ... 1349247553 receive_complete 3:3 len 232 1349247751 cluster node 2 removed seq 1236 1349247751 fenced:daemon conf 2 0 1 memb 3 4 join left 2 1349247751 fenced:default conf 2 0 1 memb 3 4 join left 2 1349247751 add_change cg 3 remove nodeid 2 reason 3 1349247751 add_change cg 3 m 2 j 0 r 1 f 1 1349247751 add_victims node 2 1349247751 check_ringid cluster 1236 cpg 2:1232 1349247751 fenced:default ring 4:1236 2 memb 4 3 1349247751 check_ringid done cluster 1236 cpg 4:1236 1349247751 check_quorum not quorate 1349247751 fenced:daemon ring 4:1236 2 memb 4 3 1349248061 cluster node 2 added seq 1240 1349248061 check_ringid cluster 1240 cpg 4:1236 1349248061 fenced:daemon conf 3 1 0 memb 2 3 4 join 2 left 1349248061 cpg_mcast_joined retried 5 protocol 1349248061 fenced:daemon ring 2:1240 3 memb 2 4 3 1349248061 fenced:default conf 3 1 0 memb 2 3 4 join 2 left 1349248061 add_change cg 4 joined nodeid 2 1349248061 add_change cg 4 m 3 j 1 r 0 f 0 1349248061 check_ringid cluster 1240 cpg 4:1236 1349248061 fenced:default ring 2:1240 3 memb 2 4 3 1349248061 check_ringid done cluster 1240 cpg 2:1240 1349248061 check_quorum done 1349248061 send_start 4:4 flags 2 started 2 m 3 j 1 r 0 f 0 1349248061 receive_protocol from 4 max 1.1.1.0 run 1.1.1.1 1349248061 daemon node 4 max 1.1.1.0 run 1.1.1.1 1349248061 daemon node 4 join 1349247548 left 0 local quorum 1349248061 1349248061 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1 1349248061 daemon node 3 max 1.1.1.0 run 1.1.1.1 1349248061 daemon node 3 join 1349247548 left 0 local quorum 1349248061 1349248061 receive_start 4:4 len 232 1349248061 match_change 4:4 skip cg 3 expect counts 2 0 1 1 1349248061 match_change 4:4 matches cg 4 1349248061 wait_messages cg 4 need 2 of 3 1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1 1349248061 daemon node 2 max 0.0.0.0 run 0.0.0.0 1349248061 daemon node 2 join 1349248061 left 1349247751 local quorum 1349248061 1349248061 daemon node 2 stateful merge 1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1 1349248061 daemon node 2 max 0.0.0.0 run 0.0.0.0 1349248061 daemon node 2 join 1349248061 left 1349247751 local quorum 1349248061 1349248061 daemon node 2 stateful merge 1349248061 receive_start 3:5 len 232 1349248061 match_change 3:5 skip cg 3 expect counts 2 0 1 1 1349248061 match_change 3:5 matches cg 4 1349248061 wait_messages cg 4 need 1 of 3 1349248061 receive_start 2:5 len 232 1349248061 match_change 2:5 skip cg 3 sender not member 1349248061 match_change 2:5 matches cg 4 1349248061 receive_start 2:5 add node with started_count 1 1349248061 wait_messages cg 4 need 1 of 3 hp3:~# fence_tool dump ... 1349247553 receive_complete 3:3 len 232 1349247751 cluster node 2 removed seq 1236 1349247751 fenced:daemon conf 2 0 1 memb 3 4 join left 2 1349247751 fenced:default conf 2 0 1 memb 3 4 join left 2 1349247751 add_change cg 4 remove nodeid 2 reason 3 1349247751 add_change cg 4 m 2 j 0 r 1 f 1 1349247751 add_victims node 2 1349247751 check_ringid cluster 1236 cpg 2:1232 1349247751 fenced:default ring 4:1236 2 memb 4 3 1349247751 check_ringid done cluster 1236 cpg 4:1236 1349247751 check_quorum not quorate 1349247751 fenced:daemon ring 4:1236 2 memb 4 3 1349248061 cluster node 2 added seq 1240 1349248061 check_ringid cluster 1240 cpg 4:1236 1349248061 fenced:daemon conf 3 1 0 memb 2 3 4 join 2 left 1349248061 cpg_mcast_joined retried 5 protocol 1349248061 fenced:daemon ring 2:1240 3 memb 2 4 3 1349248061 receive_protocol from 4 max 1.1.1.0 run 1.1.1.1 1349248061 daemon node 4 max 1.1.1.0 run 1.1.1.1 1349248061 daemon node 4 join 1349247548 left 0 local quorum 1349248061 1349248061 fenced:default conf 3 1 0 memb 2 3 4 join 2 left 1349248061 add_change cg 5 joined nodeid 2 1349248061 add_change cg 5 m 3 j 1 r 0 f 0 1349248061 check_ringid cluster 1240 cpg 4:1236 1349248061 fenced:default ring 2:1240 3 memb 2 4 3 1349248061 check_ringid done cluster 1240 cpg 2:1240 1349248061 check_quorum done 1349248061 send_start 3:5 flags 2 started 3 m 3 j 1 r 0 f 0 1349248061 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1 1349248061 daemon node 3 max 1.1.1.0 run 1.1.1.1 1349248061 daemon node 3 join 1349247425 left 0 local quorum 1349248061 1349248061 receive_start 4:4 len 232 1349248061 match_change 4:4 skip cg 4 expect counts 2 0 1 1 1349248061 match_change 4:4 matches cg 5 1349248061 wait_messages cg 5 need 2 of 3 1349248061 receive_start 3:5 len 232 1349248061 match_change 3:5 skip cg 4 expect counts 2 0 1 1 1349248061 match_change 3:5 matches cg 5 1349248061 wait_messages cg 5 need 1 of 3 1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1 1349248061 daemon node 2 max 0.0.0.0 run 0.0.0.0 1349248061 daemon node 2 join 1349248061 left 1349247751 local quorum 1349248061 1349248061 daemon node 2 stateful merge 1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1 1349248061 daemon node 2 max 0.0.0.0 run 0.0.0.0 1349248061 daemon node 2 join 1349248061 left 1349247751 local quorum 1349248061 1349248061 daemon node 2 stateful merge 1349248061 receive_start 2:5 len 232 1349248061 match_change 2:5 skip cg 4 sender not member 1349248061 match_change 2:5 matches cg 5 1349248061 receive_start 2:5 add node with started_count 1 1349248061 wait_messages cg 5 need 1 of 3 hp1:~# fence_tool dump ... 1349247551 our_nodeid 2 our_name hp1 1349247552 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/fenced.log 1349247552 logfile cur mode 100644 1349247552 cpg_join fenced:daemon ... 1349247552 setup_cpg_daemon 10 1349247552 group_mode 3 compat 0 1349247552 fenced:daemon conf 3 1 0 memb 2 3 4 join 2 left 1349247552 fenced:daemon ring 2:1232 3 memb 2 4 3 1349247552 receive_protocol from 4 max 1.1.1.0 run 1.1.1.1 1349247552 daemon node 4 max 0.0.0.0 run 0.0.0.0 1349247552 daemon node 4 join 1349247552 left 0 local quorum 1349247551 1349247552 run protocol from nodeid 4 1349247552 daemon run 1.1.1 max 1.1.1 1349247552 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1 1349247552 daemon node 3 max 0.0.0.0 run 0.0.0.0 1349247552 daemon node 3 join 1349247552 left 0 local quorum 1349247551 1349247552 receive_protocol from 2 max 1.1.1.0 run 0.0.0.0 1349247552 daemon node 2 max 0.0.0.0 run 0.0.0.0 1349247552 daemon node 2 join 1349247552 left 0 local quorum 1349247551 1349247552 receive_protocol from 2 max 1.1.1.0 run 1.1.1.0 1349247552 daemon node 2 max 1.1.1.0 run 0.0.0.0 1349247552 daemon node 2 join 1349247552 left 0 local quorum 1349247551 1349247553 client connection 3 fd 13 1349247553 added 4 nodes from ccs 1349247553 cpg_join fenced:default ... 1349247553 fenced:default conf 3 1 0 memb 2 3 4 join 2 left 1349247553 add_change cg 1 joined nodeid 2 1349247553 add_change cg 1 m 3 j 1 r 0 f 0 1349247553 add_victims_init nodeid 1 1349247553 check_ringid cluster 1232 cpg 0:0 1349247553 fenced:default ring 2:1232 3 memb 2 4 3 1349247553 check_ringid done cluster 1232 cpg 2:1232 1349247553 check_quorum done 1349247553 send_start 2:1 flags 1 started 0 m 3 j 1 r 0 f 0 1349247553 receive_start 3:3 len 232 1349247553 match_change 3:3 matches cg 1 1349247553 save_history 1 master 3 time 1349247441 how 1 1349247553 wait_messages cg 1 need 2 of 3 1349247553 receive_start 2:1 len 232 1349247553 match_change 2:1 matches cg 1 1349247553 wait_messages cg 1 need 1 of 3 1349247553 receive_start 4:2 len 232 1349247553 match_change 4:2 matches cg 1 1349247553 wait_messages cg 1 got all 3 1349247553 set_master from 0 to complete node 3 1349247553 fencing deferred to hp3 1349247553 receive_complete 3:3 len 232 1349247553 receive_complete clear victim nodeid 1 init 1 1349247750 cluster node 3 removed seq 1236 1349247750 cluster node 4 removed seq 1236 1349247751 fenced:daemon conf 2 0 1 memb 2 4 join left 3 1349247751 fenced:daemon conf 1 0 1 memb 2 join left 4 1349247751 fenced:daemon ring 2:1236 1 memb 2 1349247751 fenced:default conf 2 0 1 memb 2 4 join left 3 1349247751 add_change cg 2 remove nodeid 3 reason 3 1349247751 add_change cg 2 m 2 j 0 r 1 f 1 1349247751 add_victims node 3 1349247751 check_ringid cluster 1236 cpg 2:1232 1349247751 fenced:default conf 1 0 1 memb 2 join left 4 1349247751 add_change cg 3 remove nodeid 4 reason 3 1349247751 add_change cg 3 m 1 j 0 r 1 f 1 1349247751 add_victims node 4 1349247751 check_ringid cluster 1236 cpg 2:1232 1349247751 fenced:default ring 2:1236 1 memb 2 1349247751 check_ringid done cluster 1236 cpg 2:1236 1349247751 check_quorum not quorate 1349248061 cluster node 3 added seq 1240 1349248061 cluster node 4 added seq 1240 1349248061 check_ringid cluster 1240 cpg 2:1236 1349248061 fenced:daemon conf 2 1 0 memb 2 3 join 3 left 1349248061 cpg_mcast_joined retried 6 protocol 1349248061 fenced:daemon conf 3 1 0 memb 2 3 4 join 4 left 1349248061 fenced:daemon ring 2:1240 3 memb 2 4 3 1349248061 receive_protocol from 4 max 1.1.1.0 run 1.1.1.1 1349248061 daemon node 4 max 0.0.0.0 run 0.0.0.0 1349248061 daemon node 4 join 1349248061 left 1349247751 local quorum 1349248061 1349248061 daemon node 4 stateful merge 1349248061 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1 1349248061 daemon node 3 max 0.0.0.0 run 0.0.0.0 1349248061 daemon node 3 join 1349248061 left 1349247751 local quorum 1349248061 1349248061 daemon node 3 stateful merge 1349248061 fenced:default conf 2 1 0 memb 2 3 join 3 left 1349248061 add_change cg 4 joined nodeid 3 1349248061 add_change cg 4 m 2 j 1 r 0 f 0 1349248061 check_ringid cluster 1240 cpg 2:1236 1349248061 fenced:default conf 3 1 0 memb 2 3 4 join 4 left 1349248061 add_change cg 5 joined nodeid 4 1349248061 add_change cg 5 m 3 j 1 r 0 f 0 1349248061 check_ringid cluster 1240 cpg 2:1236 1349248061 fenced:default ring 2:1240 3 memb 2 4 3 1349248061 check_ringid done cluster 1240 cpg 2:1240 1349248061 check_quorum done 1349248061 send_start 2:5 flags 2 started 1 m 3 j 1 r 0 f 0 1349248061 receive_start 4:4 len 232 1349248061 match_change 4:4 skip cg 2 created 1349247751 cluster add 1349248061 1349248061 match_change 4:4 skip cg 3 sender not member 1349248061 match_change 4:4 skip cg 4 sender not member 1349248061 match_change 4:4 matches cg 5 1349248061 receive_start 4:4 add node with started_count 2 1349248061 wait_messages cg 5 need 3 of 3 1349248061 receive_start 3:5 len 232 1349248061 match_change 3:5 skip cg 2 sender not member 1349248061 match_change 3:5 skip cg 3 sender not member 1349248061 match_change 3:5 skip cg 4 expect counts 2 1 0 0 1349248061 match_change 3:5 matches cg 5 1349248061 receive_start 3:5 add node with started_count 3 1349248061 wait_messages cg 5 need 3 of 3 1349248061 receive_start 2:5 len 232 1349248061 match_change 2:5 skip cg 2 expect counts 2 0 1 1 1349248061 match_change 2:5 skip cg 3 expect counts 1 0 1 1 1349248061 match_change 2:5 skip cg 4 expect counts 2 1 0 0 1349248061 match_change 2:5 matches cg 5 1349248061 wait_messages cg 5 need 2 of 3 1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1 1349248061 daemon node 2 max 1.1.1.0 run 1.1.1.0 1349248061 daemon node 2 join 1349247552 left 0 local quorum 1349248061 1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1 1349248061 daemon node 2 max 1.1.1.0 run 1.1.1.1 1349248061 daemon node 2 join 1349247552 left 0 local quorum 1349248061
