Thanks Chrissie, that was an old artifact from testing with two nodes. I set the expected votes now to 4 (3 existing nodes in the cluster and one new) but I still have the same issue. It seems like the new node can't gain quorum over corosync, I see multicast packets flowing over the wire but quorum membership seems to be static:
Feb 24 11:29:09 corosync [QUORUM] Members[3]: 1 2 3 Version: 6.2.0 Config Version: 4 Cluster Name: hv-1618-106-1 Cluster Id: 11612 Cluster Member: Yes Cluster Generation: 244 Membership state: Cluster-Member Nodes: 3 Expected votes: 4 Total votes: 3 Node votes: 1 Quorum: 3 Active subsystems: 8 Flags: Ports Bound: 0 11 Node name: node01 Node ID: 1 Multicast addresses: 239.192.45.137 Node addresses: 10.14.10.6 On Node04: Starting cluster: Checking if cluster has been disabled at boot... [ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs... [ OK ] Starting cman... [ OK ] Waiting for quorum... Timed-out waiting for cluster [FAILED] Stopping cluster: Leaving fence domain... [ OK ] Stopping gfs_controld... [ OK ] Stopping dlm_controld... [ OK ] Stopping fenced... [ OK ] Stopping cman... [ OK ] Waiting for corosync to shutdown: [ OK ] Unloading kernel modules... [ OK ] Unmounting configfs... [ OK ] Node status: Node Sts Inc Joined Name 1 M 236 2014-02-24 00:22:32 node01 2 M 240 2014-02-24 00:22:34 node02 3 M 244 2014-02-24 00:22:38 node03 4 X 0 node04 On Mon, Feb 24, 2014 at 2:25 AM, Christine Caulfield <ccaul...@redhat.com>wrote: > On 24/02/14 08:39, Bjoern Teipel wrote: > >> Hi Fabio, >> >> removing UDPU does not change the behavior, the new node still doesn't >> join the cluster and still wants to fence node 01 >> It still feels like a split brain or so. >> How do you join a new node, using the /etc/init.d/cman start or using >> cman_tool / dlm_tool join ? >> >> Bjoern >> >> >> On Sat, Feb 22, 2014 at 10:16 PM, Fabio M. Di Nitto <fdini...@redhat.com >> <mailto:fdini...@redhat.com>> wrote: >> >> On 02/22/2014 08:05 PM, Bjoern Teipel wrote: >> > Thanks Fabio for replying may request. >> > >> > I'm using stock CentOS 6.4 versions and no rm, just clvmd and dlm. >> > >> > Name : cman Relocations: (not >> relocatable) >> > Version : 3.0.12.1 Vendor: CentOS >> > Release : 49.el6_4.2 Build Date: Tue 03 >> Sep 2013 >> > 02:18:10 AM PDT >> > >> > Name : lvm2-cluster Relocations: (not >> relocatable) >> > Version : 2.02.98 Vendor: CentOS >> > Release : 9.el6_4.3 Build Date: Tue 05 >> Nov 2013 >> > 07:36:18 AM PST >> > >> > Name : corosync Relocations: (not >> relocatable) >> > Version : 1.4.1 Vendor: CentOS >> > Release : 15.el6_4.1 Build Date: Tue 14 >> May 2013 >> > 02:09:27 PM PDT >> > >> > >> > My question is based off this problem I have till January: >> > >> > >> > When ever I add a new node (I put into the cluster.conf and >> reloaded >> > with cman_tool version -r -S) I end up with situations like the >> new >> > node wants to gain the quorum and starts to fence the existing pool >> > master and appears to generate some sort of split cluster. Does >> it work >> > at all, corosync and dlm do not know about the recently added node >> ? >> >> I can see you are using UDPU and that could be the culprit. Can you >> drop >> UDPU and work with multicast? >> >> Jan/Chrissie: do you remember if we support adding nodes at runtime >> with >> UDPU? >> >> The standalone node should not have quorum at all and should not be >> able >> to fence anybody to start with. >> >> > >> > New Node >> > ========== >> > >> > Node Sts Inc Joined Name >> > 1 X 0 hv-1 >> > 2 X 0 hv-2 >> > 3 X 0 hv-3 >> > 4 X 0 hv-4 >> > 5 X 0 hv-5 >> > 6 M 80 2014-01-07 21:37:42 hv-6<--- host added >> > >> > >> > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] The network >> interface >> > [10.14.18.77] is now up. >> > Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM] Using quorum >> provider >> > quorum_cman >> > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine >> loaded: >> > corosync cluster quorum service v0.1 >> > Jan 7 21:37:42 hv-1 corosync[12564]: [CMAN ] CMAN 3.0.12.1 >> (built >> > Sep 3 2013 09:17:34) started >> > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine >> loaded: >> > corosync CMAN membership service 2.90 >> > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine >> loaded: >> > openais checkpoint service B.01.01 >> > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine >> loaded: >> > corosync extended virtual synchrony service >> > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine >> loaded: >> > corosync configuration service >> > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine >> loaded: >> > corosync cluster closed process group service v1.01 >> > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine >> loaded: >> > corosync cluster config database access v1.01 >> > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine >> loaded: >> > corosync profile loading service >> > Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM] Using quorum >> provider >> > quorum_cman >> > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine >> loaded: >> > corosync cluster quorum service v0.1 >> > Jan 7 21:37:42 hv-1 corosync[12564]: [MAIN ] Compatibility >> mode set >> > to whitetank. Using V1 and V2 of the synchronization engine. >> > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding new UDPU >> member >> > {10.14.18.65} >> > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding new UDPU >> member >> > {10.14.18.67} >> > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding new UDPU >> member >> > {10.14.18.68} >> > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding new UDPU >> member >> > {10.14.18.70} >> > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding new UDPU >> member >> > {10.14.18.66} >> > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding new UDPU >> member >> > {10.14.18.77} >> > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] A processor >> joined or >> > left the membership and a new membership was formed. >> > Jan 7 21:37:42 hv-1 corosync[12564]: [CMAN ] quorum regained, >> > resuming activity >> > Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM] This node is >> within the >> > primary component and will provide service. >> > Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM] Members[1]: 6 >> > Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM] Members[1]: 6 >> > Jan 7 21:37:42 hv-1 corosync[12564]: [CPG ] chosen downlist: >> sender >> > r(0) ip(10.14.18.77) ; members(old:0 left:0) >> > Jan 7 21:37:42 hv-1 corosync[12564]: [MAIN ] Completed service >> > synchronization, ready to provide service. >> > Jan 7 21:37:46 hv-1 fenced[12620]: fenced 3.0.12.1 started >> > Jan 7 21:37:46 hv-1 dlm_controld[12643]: dlm_controld 3.0.12.1 >> started >> > Jan 7 21:37:47 hv-1 gfs_controld[12695]: gfs_controld 3.0.12.1 >> started >> > Jan 7 21:37:54 hv-1 fenced[12620]: fencing node hv-b1clcy1 >> > >> > sudo -i corosync-objctl |grep member >> > >> > totem.interface.member.memberaddr=hv-1 >> > totem.interface.member.memberaddr=hv-2 >> > totem.interface.member.memberaddr=hv-3 >> > totem.interface.member.memberaddr=hv-4 >> > totem.interface.member.memberaddr=hv-5 >> > totem.interface.member.memberaddr=hv-6 >> > runtime.totem.pg.mrp.srp.members.6.ip=r(0) ip(10.14.18.77) >> > runtime.totem.pg.mrp.srp.members.6.join_count=1 >> > runtime.totem.pg.mrp.srp.members.6.status=joined >> > >> > >> > Existing Node >> > ============= >> > >> > member 6 has not been added to the quorum list : >> > >> > Jan 7 21:36:28 hv-1 corosync[7769]: [QUORUM] Members[4]: 1 2 3 5 >> > Jan 7 21:37:54 hv-1 corosync[7769]: [TOTEM ] A processor joined >> or >> > left the membership and a new membership was formed. >> > Jan 7 21:37:54 hv-1 corosync[7769]: [CPG ] chosen downlist: >> sender >> > r(0) ip(10.14.18.65) ; members(old:4 left:0) >> > >> > >> > Node Sts Inc Joined Name >> > 1 M 4468 2013-12-10 14:33:27 hv-1 >> > 2 M 4468 2013-12-10 14:33:27 hv-2 >> > 3 M 5036 2014-01-07 17:51:26 hv-3 >> > 4 X 4468 hv-4(dead at the moment) >> > 5 M 4468 2013-12-10 14:33:27 hv-5 >> > 6 X 0 hv-6<--- added >> > >> > >> > Jan 7 21:36:28 hv-1 corosync[7769]: [QUORUM] Members[4]: 1 2 3 5 >> > Jan 7 21:37:54 hv-1 corosync[7769]: [TOTEM ] A processor joined >> or >> > left the membership and a new membership was formed. >> > Jan 7 21:37:54 hv-1 corosync[7769]: [CPG ] chosen downlist: >> sender >> > r(0) ip(10.14.18.65) ; members(old:4 left:0) >> > Jan 7 21:37:54 hv-1 corosync[7769]: [MAIN ] Completed service >> > synchronization, ready to provide service. >> > >> > >> > totem.interface.member.memberaddr=hv-1 >> > totem.interface.member.memberaddr=hv-2 >> > totem.interface.member.memberaddr=hv-3 >> > totem.interface.member.memberaddr=hv-4 >> > totem.interface.member.memberaddr=hv-5. >> > runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(10.14.18.65) >> > runtime.totem.pg.mrp.srp.members.1.join_count=1 >> > runtime.totem.pg.mrp.srp.members.1.status=joined >> > runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(10.14.18.66) >> > runtime.totem.pg.mrp.srp.members.2.join_count=1 >> > runtime.totem.pg.mrp.srp.members.2.status=joined >> > runtime.totem.pg.mrp.srp.members.4.ip=r(0) ip(10.14.18.68) >> > runtime.totem.pg.mrp.srp.members.4.join_count=1 >> > runtime.totem.pg.mrp.srp.members.4.status=left >> > runtime.totem.pg.mrp.srp.members.5.ip=r(0) ip(10.14.18.70) >> > runtime.totem.pg.mrp.srp.members.5.join_count=1 >> > runtime.totem.pg.mrp.srp.members.5.status=joined >> > runtime.totem.pg.mrp.srp.members.3.ip=r(0) ip(10.14.18.67) >> > runtime.totem.pg.mrp.srp.members.3.join_count=3 >> > runtime.totem.pg.mrp.srp.members.3.status=joined >> > >> > >> > cluster.conf: >> > >> > <?xml version="1.0"?> >> > <cluster config_version="32" name="hv-1618-110-1"> >> > <fence_daemon clean_start="0"/> >> > <cman transport="udpu" expected_votes="1"/> >> > > > Setting expected_votes to 1 in a six node cluster is a serious > configuration error and needs to be changed. That is what is causing the > new node to fence the rest of the cluster. > > Check that all of the nodes have the same cluster.conf file, any > difference between that on the exiting nodes and the new one will prevent > the new node from joining too. > > Chrissie > > >
-- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster