Re: [Linux-cluster] daemon cpg_join error retrying

Lax Kota (lkota) Mon, 03 Nov 2014 10:35:44 -0800

Hi Honza,

I am running on the packaged version from RHEL 6.4. The exact version is 
'corosync-1.4.1-15'


Thanks
Lax

-----Original Message-----
From: linux-cluster-boun...@redhat.com 
[mailto:linux-cluster-boun...@redhat.com] On Behalf Of Jan Friesse
Sent: Monday, November 03, 2014 1:03 AM
To: linux clustering
Subject: Re: [Linux-cluster] daemon cpg_join error retrying

Lax,

> 
> 
> 
>> This is just weird. What exact version of corosync are you running? Do you 
>> have latest Z stream?
> I am running  on Corosync 1.4.1 and pacemaker version is 1.1.8-7.el6

Are you running package version (like RHEL/CentOS) or did you compiled package 
by yourself? If package version, can you please send exact version (like 
1.4.1-17.1)?

> How should I get access to Z stream? Is there a specific dir I should pick 
> this z stream from?

For RHEL you are subscribed to RHN, so you should get it automatically, for 
CentOS, you should get it automatically.


Regards,
  Honza

> 
> Thanks
> Lax
> 
> 
> -----Original Message-----
> From: linux-cluster-boun...@redhat.com 
> [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Jan Friesse
> Sent: Friday, October 31, 2014 9:43 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
> 
> Lax,
> 
> 
>> Thanks Honza. Here is what I was doing,
>>
>>> usual reasons for this problem:
>>> 1. mtu is too high and fragmented packets are not enabled (take a 
>>> look to netmtu configuration option)
>> I am running with default mtu settings which is 1500. And I do see my 
>> interface(eth1) on the box does have MTU as 1500 too.
>>
> 
> Keep in mind that if they are not directly connected, switch can throw 
> packets because of MTU.
> 
>>
>> 2. config file on nodes are not in sync and one node may contain more node 
>> entries then other nodes (this may be also the case if you have two > 
>> clusters and one cluster contains entry of one node for other cluster) 3. 
>> firewall is asymmetrically blocked (so node can send but not receive). Also 
>> keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu 
>> uses one socket per remote node for sending.
>> Verfiifed my config files cluster.conf and cib.xml and both have same 
>> no of node entries (2)
>>
>>> I would recommend to disable firewall completely (for testing) and if 
>>> everything will work, you just need to adjust firewall.
>> I also ran tests with firewall off too on both the participating 
>> nodes, still see same issue
>>
>> In corosync log I see repeated set of these messages, hoping these will give 
>> some more pointers.
>>
>> Oct 29 22:11:02 corosync [SYNC  ] Committing synchronization for 
>> (corosync cluster closed process group service v1.01) Oct 29 22:11:02 
>> corosync [MAIN  ] Completed service synchronization, ready to provide 
>> service.
>> Oct 29 22:11:02 corosync [TOTEM ] waiting_trans_ack changed to 0 Oct
>> 29 22:11:03 corosync [TOTEM ] entering GATHER state from 11.
>> Oct 29 22:11:03 corosync [TOTEM ] entering GATHER state from 10.
>> Oct 29 22:11:05 corosync [TOTEM ] entering GATHER state from 0.
> 
> This is just weird. What exact version of corosync are you running? Do you 
> have latest Z stream?
> 
> Regards,
>    Honza
> 
>> Oct 29 22:11:05 corosync [TOTEM ] got commit token Oct 29 22:11:05 
>> corosync [TOTEM ] Saving state aru 1b high seq received 1b Oct 29
>> 22:11:05 corosync [TOTEM ] Storing new sequence id for ring 51708 Oct
>> 29 22:11:05 corosync [TOTEM ] entering COMMIT state.
>> Oct 29 22:11:05 corosync [TOTEM ] got commit token Oct 29 22:11:05 
>> corosync [TOTEM ] entering RECOVERY state.
>> Oct 29 22:11:05 corosync [TOTEM ] TRANS [0] member 172.28.0.64:
>> Oct 29 22:11:05 corosync [TOTEM ] TRANS [1] member 172.28.0.65:
>> Oct 29 22:11:05 corosync [TOTEM ] position [0] member 172.28.0.64:
>> Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep
>> 172.28.0.64 Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 
>> 1b received flag 1 Oct 29 22:11:05 corosync [TOTEM ] position [1] member 
>> 172.28.0.65:
>> Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep
>> 172.28.0.64 Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 
>> 1b received flag 1 Oct 29 22:11:05 corosync [TOTEM ] Did not need to 
>> originate any messages in recovery.
>> Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set 
>> retrans flag0 retrans queue empty 1 count 0, aru ffffffff Oct 29
>> 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 
>> Oct
>> 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 1, aru 0 Oct 29 22:11:05 corosync 
>> [TOTEM ] install seq 0 aru 0 high seq received 0 Oct 29 22:11:05 
>> corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 
>> retrans queue empty 1 count 2, aru 0 Oct 29 22:11:05 corosync [TOTEM 
>> ] install seq 0 aru 0 high seq received 0 Oct 29 22:11:05 corosync 
>> [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue 
>> empty 1 count 3, aru 0 Oct 29 22:11:05 corosync [TOTEM ] install seq 
>> 0 aru 0 high seq received 0 Oct 29 22:11:05 corosync [TOTEM ] retrans 
>> flag count 4 token aru 0 install seq 0 aru 0 0 Oct 29 22:11:05 
>> corosync [TOTEM ] Resetting old ring state Oct 29 22:11:05 corosync 
>> [TOTEM ] recovery to regular 1-0 Oct 29 22:11:05 corosync [CMAN  ] ais:
>> confchg_fn called type = 1, seq=333576 Oct 29 22:11:05 corosync 
>> [TOTEM ] waiting_trans_ack changed to 1 Oct 29 22:11:05 corosync 
>> [CMAN  ]
>> ais: confchg_fn called type = 0, seq=333576 Oct 29 22:11:05 corosync 
>> [CMAN  ] ais: last memb_count = 2, current = 2 Oct 29 22:11:05 
>> corosync [CMAN  ] memb: sending TRANSITION message. cluster_name = 
>> vsomcluster Oct 29 22:11:05 corosync [CMAN  ] ais: comms send message 
>> 0x7fff8185ca00 len = 65 Oct 29 22:11:05 corosync [CMAN  ] daemon: sending 
>> reply 103 to fd 24 Oct 29 22:11:05 corosync [CMAN  ] daemon: sending reply 
>> 103 to fd 34 Oct 29 22:11:05 corosync [SYNC  ] This node is within the 
>> primary component and will provide service.
>> Oct 29 22:11:05 corosync [TOTEM ] entering OPERATIONAL state.
>> Oct 29 22:11:05 corosync [TOTEM ] A processor joined or left the membership 
>> and a new membership was formed.
>> Oct 29 22:11:05 corosync [CMAN  ] ais: deliver_fn source nodeid = 2, 
>> len=81, endian_conv=0 Oct 29 22:11:05 corosync [CMAN  ] memb: Message 
>> on port 0 is 5 Oct 29 22:11:05 corosync [CMAN  ] memb: got TRANSITION 
>> from node 2 Oct 29 22:11:05 corosync [CMAN  ] memb: Got TRANSITION 
>> message. msg->flags=20, node->flags=20, first_trans=0 Oct 29 22:11:05 
>> corosync [CMAN  ] memb: add_ais_node ID=2, incarnation = 333576 Oct 
>> 29
>> 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 corosync 
>> [SYNC  ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC  ] 
>> Barrier completion status for nodeid 1 = 0.
>> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
>> Oct 29 22:11:05 corosync [CMAN  ] ais: deliver_fn source nodeid = 1, 
>> len=81, endian_conv=0 Oct 29 22:11:05 corosync [CMAN  ] memb: Message 
>> on port 0 is 5 Oct 29 22:11:05 corosync [CMAN  ] memb: got TRANSITION 
>> from node 1 Oct 29 22:11:05 corosync [CMAN  ] Completed first 
>> transition with nodes on the same config versions Oct 29 22:11:05 
>> corosync [CMAN  ] memb: Got TRANSITION message. msg->flags=20,
>> node->flags=20, first_trans=0 Oct 29 22:11:05 corosync [CMAN  ] memb: 
>> add_ais_node ID=1, incarnation = 333576 Oct 29 22:11:05 corosync 
>> [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC  ] Barrier Start 
>> Received From 1 Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status 
>> for nodeid 1 = 1.
>> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
>> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed 
>> Oct 29 22:11:05 corosync [SYNC  ] Synchronization actions starting 
>> for (dummy CLM service) Oct 29 22:11:05 corosync [SYNC  ] confchg 
>> entries
>> 2 Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 1 Oct
>> 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
>> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 0.
>> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 
>> corosync [SYNC  ] Barrier Start Received From 2 Oct 29 22:11:05 
>> corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
>> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
>> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed 
>> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for 
>> (dummy CLM service) Oct 29 22:11:05 corosync [SYNC  ] Synchronization 
>> actions starting for (dummy AMF service) Oct 29 22:11:05 corosync 
>> [SYNC  ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC  ] Barrier 
>> Start Received From 2 Oct 29 22:11:05 corosync [SYNC  ] Barrier completion 
>> status for nodeid 1 = 0.
>> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
>> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 
>> corosync [SYNC  ] Barrier Start Received From 1 Oct 29 22:11:05 
>> corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
>> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
>> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed 
>> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for 
>> (dummy AMF service) Oct 29 22:11:05 corosync [SYNC  ] Synchronization 
>> actions starting for (openais checkpoint service B.01.01) Oct 29
>> 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 corosync 
>> [SYNC  ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC  ] Barrier 
>> Start Received From 1 Oct 29 22:11:05 corosync [SYNC  ] Barrier completion 
>> status for nodeid 1 = 1.
>> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 0.
>> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 
>> corosync [SYNC  ] Barrier Start Received From 2 Oct 29 22:11:05 
>> corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
>> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
>> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed 
>> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for 
>> (openais checkpoint service B.01.01) Oct 29 22:11:05 corosync [SYNC  
>> ] Synchronization actions starting for (dummy EVT service) Oct 29
>> 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 corosync 
>> [SYNC  ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC  ] 
>> Barrier completion status for nodeid 1 = 0.
>> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
>> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 
>> corosync [SYNC  ] Barrier Start Received From 1 Oct 29 22:11:05 
>> corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
>> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
>> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed 
>> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for 
>> (dummy EVT service) Oct 29 22:11:05 corosync [SYNC  ] Synchronization 
>> actions starting for (corosync cluster closed process group service v1.01)
>> Oct 29 22:11:05 corosync [CPG   ] got joinlist message from node 1
>> Oct 29 22:11:05 corosync [CPG   ] comparing: sender r(0) ip(172.28.0.65) ; 
>> members(old:2 left:0)
>> Oct 29 22:11:05 corosync [CPG   ] comparing: sender r(0) ip(172.28.0.64) ; 
>> members(old:2 left:0)
>> Oct 29 22:11:05 corosync [CPG   ] chosen downlist: sender r(0) 
>> ip(172.28.0.64) ; members(old:2 left:0)
>> Oct 29 22:11:05 corosync [CPG   ] got joinlist message from node 2
>> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 
>> corosync [SYNC  ] Barrier Start Received From 1 Oct 29 22:11:05 
>> corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
>> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 0.
>> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 
>> corosync [SYNC  ] Barrier Start Received From 2 Oct 29 22:11:05 
>> corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
>> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
>> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed
>> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[0] group:crmd\x00, 
>> ip:r(0) ip(172.28.0.65) , pid:9198
>> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[1] group:attrd\x00, 
>> ip:r(0) ip(172.28.0.65) , pid:9196
>> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[2] group:stonith-ng\x00, 
>> ip:r(0) ip(172.28.0.65) , pid:9194
>> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[3] group:cib\x00, 
>> ip:r(0) ip(172.28.0.65) , pid:9193
>> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[4] group:pcmk\x00, 
>> ip:r(0) ip(172.28.0.65) , pid:9187
>> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[5] 
>> group:gfs:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9111
>> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[6] 
>> group:dlm:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9057
>> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[7] 
>> group:fenced:default\x00, ip:r(0) ip(172.28.0.65) , pid:9040
>> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[8] 
>> group:fenced:daemon\x00, ip:r(0) ip(172.28.0.65) , pid:9040
>> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[9] group:crmd\x00, 
>> ip:r(0) ip(172.28.0.64) , pid:14530
>> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for 
>> (corosync cluster closed process group service v1.01) Oct 29 22:11:05 
>> corosync [MAIN  ] Completed service synchronization, ready to provide 
>> service.
>>
>> Thanks
>> Lax
>>
>>
>> -----Original Message-----
>> From: linux-cluster-boun...@redhat.com 
>> [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Jan Friesse
>> Sent: Thursday, October 30, 2014 1:23 AM
>> To: linux clustering
>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
>>
>>>
>>>> On 30 Oct 2014, at 9:32 am, Lax Kota (lkota) <lk...@cisco.com> wrote:
>>>>
>>>>
>>>>>> I wonder if there is a mismatch between the cluster name in cluster.conf 
>>>>>> and the cluster name the GFS filesystem was created with.
>>>>>> How to check  cluster name of GFS file system? I had similar 
>>>>>> configuration running fine in multiple other setups with no such issue.
>>>>
>>>>> I don't really recall. Hopefully someone more familiar with GFS2 can 
>>>>> chime in.
>>>> Ok.
>>>>
>>>>>>
>>>>>> Also one more issue I am seeing in one other setup a repeated 
>>>>>> flood of 'A processor joined or left the membership and a new 
>>>>>> membership was formed' messages for every 4secs. I am running 
>>>>>> with default TOTEM settings with token time out as 10 secs. Even 
>>>>>> after I increase the token, consensus values to be higher. It 
>>>>>> goes on flooding the same message after newer consensus defined time (eg:
>>>>>> if I increase it to be 10secs, then I see new membership formed 
>>>>>> messages for every 10secs)
>>>>>>
>>>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor 
>>>>>> joined or left the membership and a new membership was formed.
>>>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen 
>>>>>> downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed 
>>>>>> service synchronization, ready to provide service.
>>>>>>
>>>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor 
>>>>>> joined or left the membership and a new membership was formed.
>>>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen 
>>>>>> downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed 
>>>>>> service synchronization, ready to provide service.
>>>>
>>>>> It does not sound like your network is particularly healthy.
>>>>> Are you using multicast or udpu? If multicast, it might be worth 
>>>>> trying udpu
>>>>
>>>> I am using udpu and I also have firewall opened for ports 5404 & 5405. 
>>>> Tcpdump looks fine too, it does not complain of any issues. This is a VM 
>>>> envirornment and even if I switch to other node within same VM I keep 
>>>> getting same failure.
>>>
>>> Depending on what the host and VMs are doing, that might be your problem.
>>> In any case, I will defer to the corosync guys at this point.
>>>
>>
>> Lax,
>> usual reasons for this problem:
>> 1. mtu is too high and fragmented packets are not enabled (take a look to 
>> netmtu configuration option) 2. config file on nodes are not in sync and one 
>> node may contain more node entries then other nodes (this may be also the 
>> case if you have two clusters and one cluster contains entry of one node for 
>> other cluster) 3. firewall is asymmetrically blocked (so node can send but 
>> not receive). Also keep in mind that ports 5404 & 5405 may not be enough for 
>> udpu, because udpu uses one socket per remote node for sending.
>>
>> I would recommend to disable firewall completely (for testing) and if 
>> everything will work, you just need to adjust firewall.
>>
>> Regards,
>>    Honza
>>
>>
>>
>>>>
>>>> Thanks
>>>> Lax
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: linux-cluster-boun...@redhat.com 
>>>> [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Andrew 
>>>> Beekhof
>>>> Sent: Wednesday, October 29, 2014 3:17 PM
>>>> To: linux clustering
>>>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
>>>>
>>>>
>>>>> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) <lk...@cisco.com> wrote:
>>>>>
>>>>>> I wonder if there is a mismatch between the cluster name in cluster.conf 
>>>>>> and the cluster name the GFS filesystem was created with.
>>>>> How to check  cluster name of GFS file system? I had similar 
>>>>> configuration running fine in multiple other setups with no such issue.
>>>>
>>>> I don't really recall. Hopefully someone more familiar with GFS2 can chime 
>>>> in.
>>>>
>>>>>
>>>>> Also one more issue I am seeing in one other setup a repeated 
>>>>> flood of 'A processor joined or left the membership and a new 
>>>>> membership was formed' messages for every 4secs. I am running with 
>>>>> default TOTEM settings with token time out as 10 secs. Even after 
>>>>> I increase the token, consensus values to be higher. It goes on 
>>>>> flooding the same message after newer consensus defined time (eg:
>>>>> if I increase it to be 10secs, then I see new membership formed 
>>>>> messages for every
>>>>> 10secs)
>>>>>
>>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor 
>>>>> joined or left the membership and a new membership was formed.
>>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: 
>>>>> sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed 
>>>>> service synchronization, ready to provide service.
>>>>>
>>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor 
>>>>> joined or left the membership and a new membership was formed.
>>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: 
>>>>> sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed 
>>>>> service synchronization, ready to provide service.
>>>>
>>>> It does not sound like your network is particularly healthy.
>>>> Are you using multicast or udpu? If multicast, it might be worth 
>>>> trying udpu
>>>>
>>>>>
>>>>> Thanks
>>>>> Lax
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: linux-cluster-boun...@redhat.com 
>>>>> [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Andrew 
>>>>> Beekhof
>>>>> Sent: Wednesday, October 29, 2014 2:42 PM
>>>>> To: linux clustering
>>>>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
>>>>>
>>>>>
>>>>>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) <lk...@cisco.com> wrote:
>>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon 
>>>>>> cpg_join error  retrying'. I have a 2 Node setup with pacemaker and 
>>>>>> corosync.
>>>>>
>>>>> I wonder if there is a mismatch between the cluster name in cluster.conf 
>>>>> and the cluster name the GFS filesystem was created with.
>>>>>
>>>>>>
>>>>>> Even after I force kill the pacemaker processes and reboot the server 
>>>>>> and bring the pacemaker back up, it keeps giving cpg_join error. Is  
>>>>>> there any way to fix this issue?
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Lax
>>>>>>
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster@redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster@redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster@redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster@redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster@redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster@redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
> 
> --
> Linux-cluster mailing list
> Linux-cluster@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] daemon cpg_join error retrying

Reply via email to