> This is just weird. What exact version of corosync are you running? Do you > have latest Z stream? I am running on Corosync 1.4.1 and pacemaker version is 1.1.8-7.el6
Thanks Lax -----Original Message----- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Jan Friesse Sent: Friday, October 31, 2014 9:43 AM To: linux clustering Subject: Re: [Linux-cluster] daemon cpg_join error retrying Lax, > Thanks Honza. Here is what I was doing, > >> usual reasons for this problem: >> 1. mtu is too high and fragmented packets are not enabled (take a >> look to netmtu configuration option) > I am running with default mtu settings which is 1500. And I do see my > interface(eth1) on the box does have MTU as 1500 too. > Keep in mind that if they are not directly connected, switch can throw packets because of MTU. > > 2. config file on nodes are not in sync and one node may contain more node > entries then other nodes (this may be also the case if you have two > > clusters and one cluster contains entry of one node for other cluster) 3. > firewall is asymmetrically blocked (so node can send but not receive). Also > keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu > uses one socket per remote node for sending. > Verfiifed my config files cluster.conf and cib.xml and both have same > no of node entries (2) > >> I would recommend to disable firewall completely (for testing) and if >> everything will work, you just need to adjust firewall. > I also ran tests with firewall off too on both the participating > nodes, still see same issue > > In corosync log I see repeated set of these messages, hoping these will give > some more pointers. > > Oct 29 22:11:02 corosync [SYNC ] Committing synchronization for > (corosync cluster closed process group service v1.01) Oct 29 22:11:02 > corosync [MAIN ] Completed service synchronization, ready to provide service. > Oct 29 22:11:02 corosync [TOTEM ] waiting_trans_ack changed to 0 Oct > 29 22:11:03 corosync [TOTEM ] entering GATHER state from 11. > Oct 29 22:11:03 corosync [TOTEM ] entering GATHER state from 10. > Oct 29 22:11:05 corosync [TOTEM ] entering GATHER state from 0. This is just weird. What exact version of corosync are you running? Do you have latest Z stream? Regards, Honza > Oct 29 22:11:05 corosync [TOTEM ] got commit token Oct 29 22:11:05 > corosync [TOTEM ] Saving state aru 1b high seq received 1b Oct 29 > 22:11:05 corosync [TOTEM ] Storing new sequence id for ring 51708 Oct > 29 22:11:05 corosync [TOTEM ] entering COMMIT state. > Oct 29 22:11:05 corosync [TOTEM ] got commit token Oct 29 22:11:05 > corosync [TOTEM ] entering RECOVERY state. > Oct 29 22:11:05 corosync [TOTEM ] TRANS [0] member 172.28.0.64: > Oct 29 22:11:05 corosync [TOTEM ] TRANS [1] member 172.28.0.65: > Oct 29 22:11:05 corosync [TOTEM ] position [0] member 172.28.0.64: > Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep > 172.28.0.64 Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b > received flag 1 Oct 29 22:11:05 corosync [TOTEM ] position [1] member > 172.28.0.65: > Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep > 172.28.0.64 Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b > received flag 1 Oct 29 22:11:05 corosync [TOTEM ] Did not need to originate > any messages in recovery. > Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set > retrans flag0 retrans queue empty 1 count 0, aru ffffffff Oct 29 > 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Oct > 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 1, aru 0 Oct 29 22:11:05 corosync > [TOTEM ] install seq 0 aru 0 high seq received 0 Oct 29 22:11:05 > corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans > queue empty 1 count 2, aru 0 Oct 29 22:11:05 corosync [TOTEM ] install > seq 0 aru 0 high seq received 0 Oct 29 22:11:05 corosync [TOTEM ] > token retrans flag is 0 my set retrans flag0 retrans queue empty 1 > count 3, aru 0 Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 > high seq received 0 Oct 29 22:11:05 corosync [TOTEM ] retrans flag > count 4 token aru 0 install seq 0 aru 0 0 Oct 29 22:11:05 corosync > [TOTEM ] Resetting old ring state Oct 29 22:11:05 corosync [TOTEM ] > recovery to regular 1-0 Oct 29 22:11:05 corosync [CMAN ] ais: > confchg_fn called type = 1, seq=333576 Oct 29 22:11:05 corosync [TOTEM > ] waiting_trans_ack changed to 1 Oct 29 22:11:05 corosync [CMAN ] > ais: confchg_fn called type = 0, seq=333576 Oct 29 22:11:05 corosync > [CMAN ] ais: last memb_count = 2, current = 2 Oct 29 22:11:05 > corosync [CMAN ] memb: sending TRANSITION message. cluster_name = > vsomcluster Oct 29 22:11:05 corosync [CMAN ] ais: comms send message > 0x7fff8185ca00 len = 65 Oct 29 22:11:05 corosync [CMAN ] daemon: sending > reply 103 to fd 24 Oct 29 22:11:05 corosync [CMAN ] daemon: sending reply > 103 to fd 34 Oct 29 22:11:05 corosync [SYNC ] This node is within the > primary component and will provide service. > Oct 29 22:11:05 corosync [TOTEM ] entering OPERATIONAL state. > Oct 29 22:11:05 corosync [TOTEM ] A processor joined or left the membership > and a new membership was formed. > Oct 29 22:11:05 corosync [CMAN ] ais: deliver_fn source nodeid = 2, > len=81, endian_conv=0 Oct 29 22:11:05 corosync [CMAN ] memb: Message > on port 0 is 5 Oct 29 22:11:05 corosync [CMAN ] memb: got TRANSITION > from node 2 Oct 29 22:11:05 corosync [CMAN ] memb: Got TRANSITION > message. msg->flags=20, node->flags=20, first_trans=0 Oct 29 22:11:05 > corosync [CMAN ] memb: add_ais_node ID=2, incarnation = 333576 Oct 29 > 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync > [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC ] > Barrier completion status for nodeid 1 = 0. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [CMAN ] ais: deliver_fn source nodeid = 1, > len=81, endian_conv=0 Oct 29 22:11:05 corosync [CMAN ] memb: Message > on port 0 is 5 Oct 29 22:11:05 corosync [CMAN ] memb: got TRANSITION > from node 1 Oct 29 22:11:05 corosync [CMAN ] Completed first > transition with nodes on the same config versions Oct 29 22:11:05 > corosync [CMAN ] memb: Got TRANSITION message. msg->flags=20, > node->flags=20, first_trans=0 Oct 29 22:11:05 corosync [CMAN ] memb: > add_ais_node ID=1, incarnation = 333576 Oct 29 22:11:05 corosync [SYNC > ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received > From 1 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid > 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Synchronization actions starting for > (dummy CLM service) Oct 29 22:11:05 corosync [SYNC ] confchg entries > 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 1 Oct > 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 0. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (dummy CLM service) Oct 29 22:11:05 corosync [SYNC ] Synchronization > actions starting for (dummy AMF service) Oct 29 22:11:05 corosync > [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier > Start Received From 2 Oct 29 22:11:05 corosync [SYNC ] Barrier completion > status for nodeid 1 = 0. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (dummy AMF service) Oct 29 22:11:05 corosync [SYNC ] Synchronization > actions starting for (openais checkpoint service B.01.01) Oct 29 > 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync > [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier > Start Received From 1 Oct 29 22:11:05 corosync [SYNC ] Barrier completion > status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 0. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (openais checkpoint service B.01.01) Oct 29 22:11:05 corosync [SYNC ] > Synchronization actions starting for (dummy EVT service) Oct 29 > 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync > [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC ] > Barrier completion status for nodeid 1 = 0. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (dummy EVT service) Oct 29 22:11:05 corosync [SYNC ] Synchronization actions > starting for (corosync cluster closed process group service v1.01) > Oct 29 22:11:05 corosync [CPG ] got joinlist message from node 1 > Oct 29 22:11:05 corosync [CPG ] comparing: sender r(0) ip(172.28.0.65) ; > members(old:2 left:0) > Oct 29 22:11:05 corosync [CPG ] comparing: sender r(0) ip(172.28.0.64) ; > members(old:2 left:0) > Oct 29 22:11:05 corosync [CPG ] chosen downlist: sender r(0) > ip(172.28.0.64) ; members(old:2 left:0) > Oct 29 22:11:05 corosync [CPG ] got joinlist message from node 2 > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 0. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[0] group:crmd\x00, > ip:r(0) ip(172.28.0.65) , pid:9198 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[1] group:attrd\x00, > ip:r(0) ip(172.28.0.65) , pid:9196 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[2] group:stonith-ng\x00, > ip:r(0) ip(172.28.0.65) , pid:9194 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[3] group:cib\x00, ip:r(0) > ip(172.28.0.65) , pid:9193 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[4] group:pcmk\x00, > ip:r(0) ip(172.28.0.65) , pid:9187 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[5] > group:gfs:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9111 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[6] > group:dlm:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9057 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[7] > group:fenced:default\x00, ip:r(0) ip(172.28.0.65) , pid:9040 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[8] > group:fenced:daemon\x00, ip:r(0) ip(172.28.0.65) , pid:9040 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[9] group:crmd\x00, > ip:r(0) ip(172.28.0.64) , pid:14530 > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (corosync cluster closed process group service v1.01) Oct 29 22:11:05 > corosync [MAIN ] Completed service synchronization, ready to provide service. > > Thanks > Lax > > > -----Original Message----- > From: linux-cluster-boun...@redhat.com > [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Jan Friesse > Sent: Thursday, October 30, 2014 1:23 AM > To: linux clustering > Subject: Re: [Linux-cluster] daemon cpg_join error retrying > >> >>> On 30 Oct 2014, at 9:32 am, Lax Kota (lkota) <lk...@cisco.com> wrote: >>> >>> >>>>> I wonder if there is a mismatch between the cluster name in cluster.conf >>>>> and the cluster name the GFS filesystem was created with. >>>>> How to check cluster name of GFS file system? I had similar >>>>> configuration running fine in multiple other setups with no such issue. >>> >>>> I don't really recall. Hopefully someone more familiar with GFS2 can chime >>>> in. >>> Ok. >>> >>>>> >>>>> Also one more issue I am seeing in one other setup a repeated >>>>> flood of 'A processor joined or left the membership and a new >>>>> membership was formed' messages for every 4secs. I am running with >>>>> default TOTEM settings with token time out as 10 secs. Even after >>>>> I increase the token, consensus values to be higher. It goes on >>>>> flooding the same message after newer consensus defined time (eg: >>>>> if I increase it to be 10secs, then I see new membership formed >>>>> messages for every 10secs) >>>>> >>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor >>>>> joined or left the membership and a new membership was formed. >>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: >>>>> sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed >>>>> service synchronization, ready to provide service. >>>>> >>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor >>>>> joined or left the membership and a new membership was formed. >>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: >>>>> sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed >>>>> service synchronization, ready to provide service. >>> >>>> It does not sound like your network is particularly healthy. >>>> Are you using multicast or udpu? If multicast, it might be worth >>>> trying udpu >>> >>> I am using udpu and I also have firewall opened for ports 5404 & 5405. >>> Tcpdump looks fine too, it does not complain of any issues. This is a VM >>> envirornment and even if I switch to other node within same VM I keep >>> getting same failure. >> >> Depending on what the host and VMs are doing, that might be your problem. >> In any case, I will defer to the corosync guys at this point. >> > > Lax, > usual reasons for this problem: > 1. mtu is too high and fragmented packets are not enabled (take a look to > netmtu configuration option) 2. config file on nodes are not in sync and one > node may contain more node entries then other nodes (this may be also the > case if you have two clusters and one cluster contains entry of one node for > other cluster) 3. firewall is asymmetrically blocked (so node can send but > not receive). Also keep in mind that ports 5404 & 5405 may not be enough for > udpu, because udpu uses one socket per remote node for sending. > > I would recommend to disable firewall completely (for testing) and if > everything will work, you just need to adjust firewall. > > Regards, > Honza > > > >>> >>> Thanks >>> Lax >>> >>> >>> >>> -----Original Message----- >>> From: linux-cluster-boun...@redhat.com >>> [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Andrew >>> Beekhof >>> Sent: Wednesday, October 29, 2014 3:17 PM >>> To: linux clustering >>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying >>> >>> >>>> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) <lk...@cisco.com> wrote: >>>> >>>>> I wonder if there is a mismatch between the cluster name in cluster.conf >>>>> and the cluster name the GFS filesystem was created with. >>>> How to check cluster name of GFS file system? I had similar configuration >>>> running fine in multiple other setups with no such issue. >>> >>> I don't really recall. Hopefully someone more familiar with GFS2 can chime >>> in. >>> >>>> >>>> Also one more issue I am seeing in one other setup a repeated flood >>>> of 'A processor joined or left the membership and a new membership >>>> was formed' messages for every 4secs. I am running with default >>>> TOTEM settings with token time out as 10 secs. Even after I >>>> increase the token, consensus values to be higher. It goes on >>>> flooding the same message after newer consensus defined time (eg: >>>> if I increase it to be 10secs, then I see new membership formed >>>> messages for every >>>> 10secs) >>>> >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor >>>> joined or left the membership and a new membership was formed. >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: >>>> sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service >>>> synchronization, ready to provide service. >>>> >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor >>>> joined or left the membership and a new membership was formed. >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: >>>> sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service >>>> synchronization, ready to provide service. >>> >>> It does not sound like your network is particularly healthy. >>> Are you using multicast or udpu? If multicast, it might be worth >>> trying udpu >>> >>>> >>>> Thanks >>>> Lax >>>> >>>> >>>> -----Original Message----- >>>> From: linux-cluster-boun...@redhat.com >>>> [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Andrew >>>> Beekhof >>>> Sent: Wednesday, October 29, 2014 2:42 PM >>>> To: linux clustering >>>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying >>>> >>>> >>>>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) <lk...@cisco.com> wrote: >>>>> >>>>> Hi All, >>>>> >>>>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon >>>>> cpg_join error retrying'. I have a 2 Node setup with pacemaker and >>>>> corosync. >>>> >>>> I wonder if there is a mismatch between the cluster name in cluster.conf >>>> and the cluster name the GFS filesystem was created with. >>>> >>>>> >>>>> Even after I force kill the pacemaker processes and reboot the server and >>>>> bring the pacemaker back up, it keeps giving cpg_join error. Is there >>>>> any way to fix this issue? >>>>> >>>>> >>>>> Thanks >>>>> Lax >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster@redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster@redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster@redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster@redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster@redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> > > -- > Linux-cluster mailing list > Linux-cluster@redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster