I have occasionally run into this problem, too. I have found that sometimes I can work around the problem by chkconfig'ing clvmd,cman,and rgmanager off, rebooting, then manually starting cman, rgmanager, clvmd (in that order). Usually, after that, I am able to fence the node(s) and they will rejoin automatically (after re-enabling automatic startup with chkconfig, of course). I know this workaround doesn't explain *why* it happens, but it has more than once helped me get my cluster nodes back online without having to reboot all the nodes.
On Thu, Jul 31, 2008 at 1:42 PM, Mailing List <[EMAIL PROTECTED]> wrote: > Hello, > > I currently have a 9 node centos 5.1 cman/gfs cluster which I've managed to > break. > > It is broken in almost exactly the same way as stated in these two previous > threads: > > http://www.spinics.net/lists/cluster/msg10304.html > http://www.redhat.com/archives/linux-cluster/2008-May/msg00060.html > > However, I can find no resolution in the archives. My only guaranteed > resolution at this point is a cold restart of all nodes which to me seems > ridiculous (ie: I'm missing something). > > To add a little details, I have nodes cluster1...9. Nodes 7 & 8 are broken. > When I fence/reboot them, cman starts but times out on starting fencing. > cman_tools nodes shows them as joined but the fence domain looks broke. > > Any ideas? > > I have included some information for a good node, bad node, and > /var/log/messages from a good node that did the fencing. > > Good Node: > > [EMAIL PROTECTED] ~]# cman_tool nodes > Node Sts Inc Joined Name > 1 M 768 2008-07-31 12:47:19 cluster1-rhc > 2 M 776 2008-07-31 12:47:37 cluster2-rhc > 3 M 772 2008-07-31 12:47:19 cluster3-rhc > 4 M 788 2008-07-31 12:56:20 cluster4-rhc > 5 M 772 2008-07-31 12:47:19 cluster5-rhc > 6 M 784 2008-07-31 12:52:50 cluster6-rhc > 7 M 808 2008-07-31 13:24:24 cluster7-rhc > 8 X 800 cluster8-rhc > 9 M 772 2008-07-31 12:47:19 cluster9-rhc > [EMAIL PROTECTED] ~]# cman_tool services > type level name id state > fence 0 default 00010003 FAIL_START_WAIT > [1 2 3 4 5 6 9] > dlm 1 testgfs1 00020005 none > [1 2 3 4 5 6] > gfs 2 testgfs1 00010005 none > [1 2 3 4 5 6] > [EMAIL PROTECTED] ~]# cman_tool status > Version: 6.1.0 > Config Version: 13 > Cluster Name: test > Cluster Id: 1678 > Cluster Member: Yes > Cluster Generation: 808 > Membership state: Cluster-Member > Nodes: 8 > Expected votes: 9 > Total votes: 8 > Quorum: 5 > Active subsystems: 7 > Flags: Dirty > Ports Bound: 0 > Node name: cluster1-rhc > Node ID: 1 > Multicast addresses: 239.192.6.148 > Node addresses: 10.128.161.81 > [EMAIL PROTECTED] ~]# group_tool > type level name id state > fence 0 default 00010003 FAIL_START_WAIT > [1 2 3 4 5 6 9] > dlm 1 testgfs1 00020005 none > [1 2 3 4 5 6] > gfs 2 testgfs1 00010005 none > [1 2 3 4 5 6] > [EMAIL PROTECTED] ~]# > > > Bad/broken Node: > > [EMAIL PROTECTED] ~]# cman_tool nodes > Node Sts Inc Joined Name > 1 M 808 2008-07-31 13:24:24 cluster1-rhc > 2 M 808 2008-07-31 13:24:24 cluster2-rhc > 3 M 808 2008-07-31 13:24:24 cluster3-rhc > 4 M 808 2008-07-31 13:24:24 cluster4-rhc > 5 M 808 2008-07-31 13:24:24 cluster5-rhc > 6 M 808 2008-07-31 13:24:24 cluster6-rhc > 7 M 804 2008-07-31 13:24:24 cluster7-rhc > 8 X 0 cluster8-rhc > 9 M 808 2008-07-31 13:24:24 cluster9-rhc > [EMAIL PROTECTED] ~]# cman_tool services > type level name id state > fence 0 default 00000000 JOIN_STOP_WAIT > [1 2 3 4 5 6 7 9] > [EMAIL PROTECTED] ~]# cman_tool status > Version: 6.1.0 > Config Version: 13 > Cluster Name: test > Cluster Id: 1678 > Cluster Member: Yes > Cluster Generation: 808 > Membership state: Cluster-Member > Nodes: 8 > Expected votes: 9 > Total votes: 8 > Quorum: 5 > Active subsystems: 7 > Flags: Dirty > Ports Bound: 0 > Node name: cluster7-rhc > Node ID: 7 > Multicast addresses: 239.192.6.148 > Node addresses: 10.128.161.87 > [EMAIL PROTECTED] ~]# group_tool > type level name id state > fence 0 default 00000000 JOIN_STOP_WAIT > [1 2 3 4 5 6 7 9] > [EMAIL PROTECTED] ~]# > > > /var/log/messages: > > Jul 31 13:20:54 cluster3 fence_node[3813]: Fence of "cluster7-rhc" was > successful > Jul 31 13:21:03 cluster3 fence_node[3815]: Fence of "cluster8-rhc" was > successful > Jul 31 13:21:11 cluster3 openais[3084]: [TOTEM] entering GATHER state from > 12. > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering GATHER state from > 11. > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Saving state aru 89 high > seq received 89 > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Storing new sequence id for > ring 324 > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering COMMIT state. > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering RECOVERY state. > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [0] member > 10.128.161.81: > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep > 10.128.161.81 > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 > received flag 1 > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [1] member > 10.128.161.82: > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep > 10.128.161.81 > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 > received flag 1 > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [2] member > 10.128.161.83: > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep > 10.128.161.81 > Jul 31 13:21:16 cluster3 kernel: dlm: closing connection to node 7 > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 > received flag 1 > Jul 31 13:21:16 cluster3 kernel: dlm: closing connection to node 8 > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [3] member > 10.128.161.84: > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep > 10.128.161.81 > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 > received flag 1 > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [4] member > 10.128.161.85: > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep > 10.128.161.81 > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 > received flag 1 > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [5] member > 10.128.161.86: > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep > 10.128.161.81 > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 > received flag 1 > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [6] member > 10.128.161.89: > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep > 10.128.161.81 > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 > received flag 1 > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Did not need to originate > any messages in recovery. > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] CLM CONFIGURATION CHANGE > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] New Configuration: > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.81) > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.82) > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.83) > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.84) > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.85) > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.86) > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.89) > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] Members Left: > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.87) > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.88) > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] Members Joined: > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] CLM CONFIGURATION CHANGE > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] New Configuration: > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.81) > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.82) > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.83) > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.84) > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.85) > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.86) > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.89) > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] Members Left: > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] Members Joined: > Jul 31 13:21:16 cluster3 openais[3084]: [SYNC ] This node is within the > primary component and will provide service. > Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering OPERATIONAL state. > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message > 10.128.161.81 > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message > 10.128.161.82 > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message > 10.128.161.83 > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message > 10.128.161.84 > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message > 10.128.161.85 > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message > 10.128.161.86 > Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message > 10.128.161.89 > Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from > node 2 > Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from > node 3 > Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from > node 4 > Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from > node 5 > Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from > node 6 > Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from > node 9 > Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from > node 1 > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering GATHER state from > 11. > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Saving state aru 68 high > seq received 68 > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Storing new sequence id for > ring 328 > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering COMMIT state. > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering RECOVERY state. > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [0] member > 10.128.161.81: > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep > 10.128.161.81 > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 > received flag 1 > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [1] member > 10.128.161.82: > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep > 10.128.161.81 > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 > received flag 1 > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [2] member > 10.128.161.83: > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep > 10.128.161.81 > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 > received flag 1 > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [3] member > 10.128.161.84: > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep > 10.128.161.81 > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 > received flag 1 > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [4] member > 10.128.161.85: > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep > 10.128.161.81 > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 > received flag 1 > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [5] member > 10.128.161.86: > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep > 10.128.161.81 > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 > received flag 1 > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [6] member > 10.128.161.87: > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep > 10.128.161.87 > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 9 high delivered 9 > received flag 1 > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [7] member > 10.128.161.89: > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep > 10.128.161.81 > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 > received flag 1 > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Did not need to originate > any messages in recovery. > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] CLM CONFIGURATION CHANGE > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] New Configuration: > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.81) > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.82) > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.83) > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.84) > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.85) > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.86) > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.89) > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] Members Left: > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] Members Joined: > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] CLM CONFIGURATION CHANGE > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] New Configuration: > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.81) > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.82) > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.83) > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.84) > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.85) > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.86) > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.87) > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.89) > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] Members Left: > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] Members Joined: > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip( > 10.128.161.87) > Jul 31 13:24:24 cluster3 openais[3084]: [SYNC ] This node is within the > primary component and will provide service. > Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering OPERATIONAL state. > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message > 10.128.161.81 > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message > 10.128.161.82 > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message > 10.128.161.83 > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message > 10.128.161.84 > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message > 10.128.161.85 > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message > 10.128.161.86 > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message > 10.128.161.87 > Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message > 10.128.161.89 > Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from > node 6 > Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from > node 9 > Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from > node 1 > Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from > node 2 > Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from > node 3 > Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from > node 4 > Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from > node 5 > > Thanks! > > Adam > > -- > Linux-cluster mailing list > [email protected] > https://www.redhat.com/mailman/listinfo/linux-cluster >
-- Linux-cluster mailing list [email protected] https://www.redhat.com/mailman/listinfo/linux-cluster
