Re: [Openais] Question about corosync mcastaddr setting

Jan Friesse Thu, 16 May 2013 22:55:53 -0700

Moullé Alain napsal(a):
> Hi Jan,
> thanks for all your accurate answers.
> 
> Again some questions a little bit relative to this pb and another :
> 
> As we are also facing the pb described in *Bug 854216*
> <https://bugzilla.redhat.com/show_bug.cgi?id=854216> -[TOTEM] FAILED TO
> RECEIVE + corosync crash, and it is written in this Bug that it is also
> relative to the 917914 where it is written: "This generally indicates a
> network (multicast) issue between the nodes"


Yes. Bug is happening if one of nodes is unable to receive multicast
packets for quite a long time (multicast is blocked OR for example
switch decides to stop sending multicast packets OR switch is overloaded
and doesn't send packets, ...). Also for VMs, you may need to run "echo
1 > /sys/class/net/virbrN/bridge/multicast_querier" on newer kernels,
because mcast querier there is somehow buggy (see
https://bugzilla.redhat.com/show_bug.cgi?id=880035).


> 
> So now my questions are:
> 
> 1/ If we do the modifications of mcastport so the mcastport for 1st ring
> and the one for 2nd are disctinct (5405 & 5407), will it also  avoid to
> face again the corosync crash described in 854216 or not ?
> 

No it will not avoid described crash. Actually, 854216 is not related to
rrp at all. Crash can happen with and without rrp.

> 2/ Does the corosync crash described in 854216 may happen only in HA
> clusters with more than two nodes, or may it happen also in two nodes HA
> clusters ?
> 

Also with two nodes.

> 3/ Which first release of corosync includes the patch given in 854216 ?
> 

Upstream 1.4.5 (or 2.3.0). If you are asking about RHEL, it's not in
6.4, but hopefully will be in 6.5.

> Thanks again.
> Alain
> 

Regards,
  Honza

> 
> 
> 
> 
> Le 29/04/2013 08:54, Jan Friesse a écrit :
>> Moullé Alain napsal(a):
>>> Jan,
>>> OK thanks.
>>>
>>> By the way, is there a way to modify the corosync.conf mcast values and
>>> to make a Started Pacemaker/corosync configuration take it in account
>>> without a restart ?
>>>
>> Sadly not. It's one of "nice to have" features, but sadly quite complex
>> to implement.
>>
>> Regards,
>>    Honza
>>
>>> Alain
>>>> Alain,
>>>> don't test with ifdown. Ifdown changes route table and makes corosync
>>>> bind to 127.0.0.1 so other nodes sees that node as it's own and
>>>> corosync
>>>> will behave extremely badly (no reaction to other node failure or
>>>> gather
>>>> state loop).
>>>>
>>>> I'm really not able to remember what was real issue with same mcast
>>>> addresses, but when I was testing that, it had big problems. You maybe
>>>> didn't hit them because of ifdown, but they are there. Don't expect
>>>> crash of corosync. Expect inconsistent membership view, no reaction on
>>>> failure, ... In short, just don't do that.
>>>>
>>>> Moullé Alain napsal(a):
>>>>> Hi again Jan,
>>>>>
>>>>> I add another question to my last question :
>>>>> "what is the real risk with my configuration (and corosync-1.4.1-7) ?"
>>>>>
>>>>> and is there any dependancy between the risk and the IF used for both
>>>>> hearbeats ?
>>>>> I explain :
>>>>>     I've tested this configuration with two std eth IF, meaning doing
>>>>> ifdown on an IF  and then
>>>>>     same test but on the second IF , and all works fine providing
>>>>> there is
>>>>> always at least one IF reachable
>>>>>     there is no impact on Pacemaker/corosync.
>>>>>     But when mixing two IF types, let's say : a std eth IF and a
>>>>> bridge
>>>>> eth IF , or a std eth IF and a IP/IB IF, etc.
>>>>>     in this case I've always had a problem when "ifdown-ing" one or
>>>>> the IF
>>>>> (don't remember which one) as if
>>>>>     there was only one heartbeat network.
>>>>>     So my question, does the risk of rrp mode not working correctly
>>>>> (if
>>>>> mcastaddr and mcastport are the same
>>>>>     for both rings), depends on IF used ?
>>>> Not from corosync side, but routing of packets may do a really bad job,
>>>> like packet is delivered even it shouldn't ... To make really highly
>>>> available setup, use two nics per machine connected to two independent
>>>> switches.
>>>>
>>>> Honza
>>>>
>>>>>     And is the risk null when using two std eth IF ?
>>>>>
>>>>> Thanks for all information.
>>>>> Alain
>>>>>
>>>>> Le 25/04/2013 17:33, Jan Friesse a écrit :
>>>>>> Moullé Alain napsal(a):
>>>>>>> Hi,
>>>>>>>
>>>>>>> "you can choose" ... meaning that it is not mandatory ? and my
>>>>>>> configuration is correct anyway ?
>>>>>> No, your configuration is not correct. "You can choose..." means
>>>>>> binary
>>>>>> OR. So (table)
>>>>>>
>>>>>> same_mcast_addr | same_port +- 1 | works
>>>>>> ----------------------------------------
>>>>>> 0               | 0              | 1
>>>>>> 0               | 1              | 1
>>>>>> 1               | 0              | 1
>>>>>> 1               | 1              | 0
>>>>>>
>>>>>>
>>>>>>> Because somebody told me that it we put same mcastaddr it is
>>>>>>> written in
>>>>>>> corosync documentation (but I can't find where)
>>>>>>> that corosync may crash if one network becomes unreachable ...
>>>>>>> could you
>>>>>>> confirm this or reassure me telling that it is not true ;-)     ?
>>>>>> Yes, rrp will not works as expected.
>>>>>>
>>>>>> Honza
>>>>>>
>>>>>>> Thanks
>>>>>>> Alain
>>>>>>>> Moullé Alain napsal(a):
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> corosync-1.4.1-7
>>>>>>>>>
>>>>>>>>> with two rings in corosync.conf , and rrp mode active, is it
>>>>>>>>> recommended
>>>>>>>>> to have two distinct mcastaddr ?
>>>>>>>> You can choose to have ether two distinct mcastaddr(eses) or
>>>>>>>> distinct
>>>>>>>> ports (don't use port +- 1).
>>>>>>>>
>>>>>>>>> (and if so, where can I find this information ?)
>>>>>>>>>
>>>>>>>> It looks like it's not documented (thanks for that information, I
>>>>>>>> will
>>>>>>>> add to my TODO) and sadly, corosync will not complain (this already
>>>>>>>> exists in TODO)
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>       Honza
>>>>>>>>
>>>>>>>>> or it is not important and we can have same mcast addr on both
>>>>>>>>> rings ?
>>>>>>>>>
>>>>>>>>> Saying it in another way , is this sample correct :
>>>>>>>>>
>>>>>>>>>          interface {
>>>>>>>>>              ringnumber: 0
>>>>>>>>>
>>>>>>>>>              # The following three values need to be set based on
>>>>>>>>> your
>>>>>>>>> environment
>>>>>>>>>              bindnetaddr: 182.128.3.0
>>>>>>>>>              mcastaddr: 239.0.0.1
>>>>>>>>>              mcastport: 5405
>>>>>>>>>          }
>>>>>>>>>             interface {
>>>>>>>>>                     ringnumber: 1
>>>>>>>>>
>>>>>>>>>                     # The following values need to be set based on
>>>>>>>>> your
>>>>>>>>> environment
>>>>>>>>>                     bindnetaddr: 182.128.2.0
>>>>>>>>>                     mcastaddr: 239.0.0.1
>>>>>>>>>                     mcastport: 5405
>>>>>>>>>             }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Alain
>>>>>>>>> _______________________________________________
>>>>>>>>> Openais mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://lists.linuxfoundation.org/mailman/listinfo/openais
> 
> 

_______________________________________________
Openais mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/openais

Re: [Openais] Question about corosync mcastaddr setting

Reply via email to