Re: [Openais] Problem and Question about corosync

Jan Friesse Tue, 19 Nov 2013 01:27:10 -0800

Moullé Alain napsal(a):
> Hi again,
> 
> and thanks for all information.
> 
> Last one : "Another possibility may be to not use RRP and consider
> bonding."
> So perhaps I did not completely understand : do you mean that even rrp
> mode set to "passive" will not work correctly ?


Yes. With 1.2.3 it will not work.

> I understood that it was the "active" mode which was not fully
> operationnal, so I was about to set it to passive to avoid the problems ...
> 

Active mode is unsupported even with 1.4.1.

Let me summarize it for you:
1.2.3 - Just don't use it. If you are using it, you may experience
various problems. If you really need to use it, please consider bonding.
1.4.1 - Only passive mode is fully supported.

Honza

> Alain
> 
> Le 19/11/2013 09:54, Jan Friesse a écrit :
>> Alain,
>>
>> Moullé Alain napsal(a):
>>> Hi Jan,
>>>
>>> If you don't mind , I need more precisions about 1.2.3-36:
>>> I just check again the man of corosync.conf in this release, and the
>>> rrp_mode is described with its 3 possible values : none, passive and
>>> active .
>>> So "without official support for rrp" :
>>> ok "not officially supported" but does it really mean that it was not
>>> fully working correctly ?
>> Yes. It was not working correctly. Actually, we've put huge effort to
>> make it work (like 1/2 year of work).
>>
>>> that there could be randomly a bad behavior such as the one I described
>>> below ?
>> Yes
>>
>>> And also, "was last release without official support for RRP" , does
>>> that mean that the 1.4.0-1 gives a fully operationnal rrp mode ?
>> RHEL 6.1 was 1.2.3-36. Next RHEL, so RHEL 6.2, was 1.4.1-4. Keep in mind
>> that RRP in RHEL 6.2 is TechPreview, so again no support.
>>
>>> I'm a little bit afraid to upgrade to the latest 1.4 corosync release on
>>> a RHEL 6.1 ... is there no risk of compatibility with other HA rpm
>>> (pacemaker.1.1.5-5 , cluster-glue-1.0.5-2, etc.) ?
>>>
>> There should be no problems BUT you are on your own (what you are
>> anyway, because EUS for RHEL 6.1 was retired on
>> May 31, 2013).
>>
>> I would recommend you (if possible) to really consider updating to
>> latest RHEL (probably wait for 6.5 where again A HUGE amount of fixes
>> are available).
>>
>> Another possibility may be to not use RRP and consider bonding.
>>
>>
>> Regards,
>>    Honza
>>
>>> Thanks a lot for your help.
>>> Alain Moullé
>>>
>>>
>>> Le 18/11/2013 15:50, Jan Friesse a écrit :
>>>> Moullé Alain napsal(a):
>>>>> Hi,
>>>>>
>>>>> with corosync.1.2.3-36 (with Pacemaker) on a 4 nodes HA cluster, we
>>>>> got
>>>> 1.2.3-36 is problem. This was last release WITHOUT official support
>>>> for RRP.
>>>>
>>>>> a strange and random problem :
>>>>>
>>>>> For some reason that we can't identify in the syslog, one node (let's
>>>>> say node1) losts the 3 other members node2, node3, node4 (without any
>>>>> visible network problems on both heartbeat networks (configured in rrp
>>>>> active mode and with a distinct mcast address, and distinct mcast
>>>>> port)  .
>>>>> This node elects itself as a DC (isolated and whereas node2 is already
>>>>> DC) until node2 (DC) ask to node3 to fence node1 (probably because it
>>>>> detects another DC).
>>>>> Main traces are given below.
>>>>> When node1 is rebooted and Pacemaker started again, it is again
>>>>> included
>>>>> in the HA cluster and all works fine.
>>>>>
>>>>> I've checked the changelog of corosync between 1.2.3-36 and
>>>>> 1.4.1-7, but
>>>>> there are around 188 bugzilla fixed between both releases, so ....
>>>>> so I
>>>> Exactly. Actually, RRP in 1.4 is totally different then RRP in
>>>> 1.2.3. No
>>>> chance for simple "fix". Just upgrade to latest 1.4.
>>>>
>>>>> would like to know if someone in developpment team remembers of a fix
>>>>> for such a random problem where a node isolated in the cluster
>>>>> during a
>>>>> few seconds elects itself DC and consequently is then fenced by the
>>>>> former DC which is in the quorate part of the HA cluster ?
>>>>>
>>>>> And also, as workaround or normal but missing tuning, if some tuning
>>>>> exists in corosync parameters to avoid a node isolated for a few
>>>>> seconds
>>>>> to elect itself as new DC ?
>>>>>
>>>>> Thanks a lot for your help.
>>>>> Alain Moullé
>>>>>
>>>> Regards,
>>>>     Honza
>>>>
>>>>> I can see in syslog such traces :
>>>>>
>>>>> node1 syslog:
>>>>> 03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback:
>>>>> status: node2 is now lost (was member)
>>>>> ...
>>>>> 03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback:
>>>>> status: node3 is now lost (was member)
>>>>> ...
>>>>> 03:28:55 node1 daemon info crmd [26314]: info: ais_status_callback:
>>>>> status: node4 is now lost (was member)
>>>>> ...
>>>>> 03:28:55 node1 daemon warning crmd [26314]: WARN:
>>>>> check_dead_member: Our
>>>>> DC node (node2) left the cluster
>>>>> ...
>>>>> 03:28:55 node1 daemon info crmd [26314]: info: update_dc: Unset DC
>>>>> node2
>>>>> ...
>>>>> 03:28:55 node1 daemon info crmd [26314]: info: do_dc_takeover: Taking
>>>>> over DC status for this partition
>>>>> ...
>>>>> 03:28:56 node1 daemon info crmd [26314]: info: update_dc: Set DC to
>>>>> node1 (3.0.5)
>>>>>
>>>>>
>>>>> node2 syslog:
>>>>> 03:29:05 node2 daemon info corosync   [pcmk  ] info: update_member:
>>>>> Node
>>>>> 704645642/node1 is now: lost
>>>>> ...
>>>>> 03:29:05 node2 daemon info corosync   [pcmk  ] info: update_member:
>>>>> Node
>>>>> 704645642/node1 is now: member
>>>>>
>>>>>
>>>>> node3:
>>>>> 03:30:17 node3 daemon info crmd [26549]: info: tengine_stonith_notify:
>>>>> Peer node1 was terminated (reboot) by node3 for node2
>>>>> (ref=c62a4c78-21b9-4288-8969-35b361cabacf): OK
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Openais mailing list
>>>>> [email protected]
>>>>> https://lists.linuxfoundation.org/mailman/listinfo/openais
> 

_______________________________________________
Openais mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/openais

Re: [Openais] Problem and Question about corosync

Reply via email to