Re: [corosync] [Problem] Corosync cannot reconstitute a cluster.

Jan Friesse Wed, 12 Jun 2013 01:18:36 -0700

Hideo,
can you please try to test following things:

- Block communication on local nodes via iptables (so drop all UDP
traffic, something like "iptables -A INPUT ! -i lo -p udp -j DROP &&
iptables -A OUTPUT ! -o lo -p udp -j DROP") - and then remove this
rules, does corosync create membership correctly?
- Unplug cables (please make sure to NOT configure network via
networkmanager. Networkmanager does ifdown and corosync doesn't work
correctly with ifdown). Then plug cables again. Is membership
reconstructed correctly?


If result of both of test cases is correct membership then problem is in
switch. If so, you can try ether corosync UDPU mode (it's slightly
slower, but as long as GFS is not used, it's acceptable, especially for
3 nodes environment) or you can try change switch configuration.

Regards,
  Honza

[email protected] napsal(a):
> Hi Honza,
> 
> Thank you for comments.
> 
>> can you please tell me exact reproducer for physical hw? (because brctl
>> delif is I believe not valid in hw at all).
> 
> It is the next environment that I reported a problem in the second in 
> physical  environment.
> 
> -------------------------
> Enclosure               : BladeSystem c7000 Enclosure
> node1, node2, node3 : HP ProLiant BL460c G6(CPU:Xeon E5540,Mem:16G) --- Blade
>                                  NIC:Flex-10 Embedded Ethernet x 1(2Port)
>                                  NIC:NC325m Quad Port 1Gb NIC for c-Class 
> BladeSystem(4Port)
> SW                        : GbE2c Ethernet Blade Switch x 6
> -------------------------
> 
> In addition, I carried out the cutting of the interface via a switch.
>  * In the second report, I did not execute the brctl command.
> 
> Is more detailed HW information necessary?
> If there is necessary information, I send it.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> --- On Wed, 2013/6/12, Jan Friesse <[email protected]> wrote:
> 
>> Hideo,
>> can you please tell me exact reproducer for physical hw? (because brctl
>> delif is I believe not valid in hw at all).
>>
>> Thanks,
>>   Honza
>>
>> [email protected] napsal(a):
>>> Hi Fabio,
>>>
>>> Thank you for comment.
>>>
>>>> I'll let Honza look at it, I don't have enough physical hardware to
>>>> reproduce.
>>>
>>> All right.
>>>
>>> Many Thanks!
>>> Hideo Yamauchi.
>>>
>>>
>>> --- On Tue, 2013/6/11, Fabio M. Di Nitto <[email protected]> wrote:
>>>
>>>> Hi Yamauchi-san,
>>>>
>>>> I'll let Honza look at it, I don't have enough physical hardware to
>>>> reproduce.
>>>>
>>>> Fabio
>>>>
>>>> On 06/11/2013 01:15 AM, [email protected] wrote:
>>>>> Hi Fabio,
>>>>>
>>>>> Thank you for comments.
>>>>>
>>>>> We confirmed this problem in the physical environment.
>>>>> The communication of corosync lets eth1,eth2 go through.
>>>>>
>>>>> -------------------------------------------------------
>>>>> [root@bl460g6a ~]# ip addr show
>>>>> (snip)
>>>>> 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP 
>>>>> qlen 1000
>>>>>       link/ether f4:ce:46:b3:fe:3c brd ff:ff:ff:ff:ff:ff
>>>>>       inet 192.168.101.9/24 brd 192.168.101.255 scope global eth1
>>>>>       inet6 fe80::f6ce:46ff:feb3:fe3c/64 scope link 
>>>>>          valid_lft forever preferred_lft forever
>>>>> 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP 
>>>>> qlen 1000
>>>>>       link/ether 18:a9:05:78:6c:f0 brd ff:ff:ff:ff:ff:ff
>>>>>       inet 192.168.102.9/24 brd 192.168.102.255 scope global eth2
>>>>>       inet6 fe80::1aa9:5ff:fe78:6cf0/64 scope link 
>>>>>          valid_lft forever preferred_lft forever
>>>>> (snip)
>>>>> 8: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state 
>>>>> UNKNOWN 
>>>>>       link/ether 52:54:00:7f:f3:0a brd ff:ff:ff:ff:ff:ff
>>>>>       inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
>>>>> 9: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 
>>>>> 500
>>>>>       link/ether 52:54:00:7f:f3:0a brd ff:ff:ff:ff:ff:ff
>>>>> -----------------------------------------------
>>>>>
>>>>> I think that it is not a virtual environmental problem.
>>>>>
>>>>> I attach the log that I confirmed just to make sure in three 
>>>>> Blade.(RHEL6.4)
>>>>> * I performed the interception of the communication with a network switch.
>>>>>
>>>>> The phenomenon is similar, and, as for one node, a loop does an 
>>>>> OPERATIONAL state, and two other nodes do not change in an OPERATIONAL 
>>>>> state.
>>>>>
>>>>> After all is the problem same as the bug that you taught?
>>>>>> Check this thread as reference:
>>>>>> http://lists.linuxfoundation.org/pipermail/openais/2013-April/016792.html
>>>>>
>>>>>
>>>>> Best Regards,
>>>>> Hideo Yamauchi.
>>>>>
>>>>>
>>>>>
>>>>> --- On Fri, 2013/5/31, Fabio M. Di Nitto <[email protected]> wrote:
>>>>>
>>>>>> On 5/31/2013 7:12 AM, [email protected] wrote:
>>>>>>> Hi All,
>>>>>>>
>>>>>>> We discovered the problem of the network of the corosync communication.
>>>>>>>
>>>>>>> We composed a cluster of three nodes on KVM in corosync.
>>>>>>>
>>>>>>> Step 1) Start corosync service in all nodes. 
>>>>>>>
>>>>>>> Step 2) Confirm that a cluster is comprised of all nodes definitely and 
>>>>>>> became the OPERATIONAL state.
>>>>>>>
>>>>>>> Step 3) Cut off the network of node1(rh64-coro1) and node2(rh64-coro2) 
>>>>>>> from a host of KVM.
>>>>>>>
>>>>>>>           [root@kvm-host ~]# brctl delif virbr3 vnet5;brctl delif 
>>>>>>> virbr2 vnet1
>>>>>>>
>>>>>>> Step 4) Because a problem occurred, we stop all nodes.
>>>>>>>
>>>>>>>
>>>>>>> The problem occurs at the time of step 3.
>>>>>>>
>>>>>>> One node(rh64-coro1) continues moving a state after becoming the 
>>>>>>> OPERATIONAL state.
>>>>>>>
>>>>>>> Two nodes(rh64-coro2 and rh64-coro3) continue changing in a state.
>>>>>>> It seems to never change in an OPERATIONAL state while the first node 
>>>>>>> operates.
>>>>>>>
>>>>>>> This means that two nodes(rh64-coro2 and rh64-coro3) cannot complete 
>>>>>>> cluster constitution.
>>>>>>> When this network trouble happens, by the setting that corosync 
>>>>>>> combined with Pacemaker, corosync cannot notify Pacemaker of the 
>>>>>>> constitution change of the cluster.
>>>>>>>
>>>>>>>
>>>>>>> Question 1) Are there any parameters to solve this problem in 
>>>>>>> corosync.conf?
>>>>>>>     * We bundle up an interface(Bonding) and think that it can be 
>>>>>>> settled by appointing "rrp_mode:none", but do not want to appoint 
>>>>>>> "rrp_mode:none".
>>>>>>>
>>>>>>> Question 2) Is this a bug? Or is it specifications of the communication 
>>>>>>> of corosync?
>>>>>>
>>>>>> We already checked this specific test, and it appears to be a bug in
>>>>>> the kernel bridge code when handling multicast traffic (groups are not
>>>>>> joined correctly and traffic is not forwarded).
>>>>>>
>>>>>> Check this thread as reference:
>>>>>> http://lists.linuxfoundation.org/pipermail/openais/2013-April/016792.html
>>>>>>
>>>>>> Thanks
>>>>>> Fabio
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list
>>>>>> [email protected]
>>>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> [email protected]
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>
>>

_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

Re: [corosync] [Problem] Corosync cannot reconstitute a cluster.

Reply via email to