Hi ,

It is hard to tell what is happening without logs from all 3 nodes. Does
this only happen at system start, or can you duplicate 5 minutes after
systems have started?


>>> The cluster is never stabilizing. It keeps on switching between the
membership and operational state.
Below is the test network which i am using:


[image: Untitled.png]

 >>> N1 and N3 does not reveive any packets from each other. Here what i
expected was that either (N1,N2) or (N2, N3) forms a two node cluster and
stabilizes. But the cluster is never stabilizing even though 2 node clusters
are forming, it is going back to membership [I checked the logs and it looks
like because of the steps i mentioned in the previous mail, this seems to be
happening]



Regards,
Ranjith
On Fri, Sep 24, 2010 at 11:36 PM, Steven Dake <[email protected]> wrote:

> It is hard to tell what is happening without logs from all 3 nodes. Does
> this only happen at system start, or can you duplicate 5 minutes after
> systems have started?
>
> If it is at system start, you may need to enable "fast STP" on your switch.
>  It looks to me like node 3 gets some messages through but then is blocked.
>  STP will do this in it's default state on most switches.
>
> Another option if you can't enable STP is to use broadcast mode (man
> openais.conf for details).
>
> Also verify firewalls are properly configured on all nodes.  You can join
> us on the irc server freenode on #linux-cluster for real-time assistance.
>
> Regards
> -steve
>
>
> On 09/22/2010 11:33 PM, Ranjith wrote:
>
>>  Hi Steve,
>>  I am running corosync 1.2.8
>>  I didn't get what u meant by blackbox. I suppose it is logs/debugs.
>>  I just checked logs/debugs and I am able to understand the below:
>>                                   1--------------2--------------3
>> 1) Node1 and Node2 are already in a 2node cluster
>> 2) Now Node3 sends join with ({1} , {} ) (proc_list/fail_list)
>> 3) Node2 sends join ({1,2,3} , {}) and Node 1/3 updates to ({1,2,3}, {})
>> 4) Now Node 2 gets consensus after some messages [But 1 is the rep]
>> 5) Consensus timeout fires at node 1 for node 3, node1 sends join as
>> ({1,2}, {3})
>> 6) Node2 updates because of the above message to ({1,2}, {3}) and sends
>> out join. This join received by node 3 causes it to update ({1,3}, {2})
>> 7) Node1and Node2 enter operational (fail list cleared by node2) but
>> node 3 join timeout fires and again membership state.
>> 8) This will continue to happen until consensus fires at node3 for node1
>> and it moves to ({3}, {1,2})
>> 9) Now Node1and Node2 from 2 node cluster and 3 forms a single node
>> cluster
>> 10) Now node 2 broadcast a Normal message
>> 11) This message is received by Node3 as a foreign message which forces
>> it to go to gather state
>> 12) Again above steps ....
>> The cluster is never stabilizing.
>> I have attached the debugs for Node2:
>> (1 - 10.102.33.115, 2 - 10.102.33.150, 3 -10.102.33.180)
>> Regards,
>> Ranjith
>>
>> On Wed, Sep 22, 2010 at 10:53 PM, Steven Dake <[email protected]
>>  <mailto:[email protected]>> wrote:
>>
>>    On 09/21/2010 11:15 PM, Ranjith wrote:
>>
>>        Hi all,
>>        Kindly comment on the above behaviour
>>        Regards,
>>        Ranjith
>>
>>        On Tue, Sep 21, 2010 at 9:52 PM, Ranjith
>>        <[email protected] <mailto:[email protected]>
>>        <mailto:[email protected]
>>        <mailto:[email protected]>>> wrote:
>>
>>            Hi all,
>>            I was testing the corosync cluster engine by using the
>>        testcpg exec
>>            provided along with the release. I am getting the below
>>        behaviour
>>            while testing some specific scenarios. Kindly comment on the
>>            expected behaviour.
>>            1)   3 Node cluster
>>                               1---------2---------3
>>                 a) suppose I bring the nodes 1&2 up, it will form a
>>        ring (1,2)
>>                 b) now bring up 3
>>                 c) 3 sends join which restarts the membership process
>>                 d) (1,2) again forms the ring , 3 forms self cluster
>>                 e) now 3 sends a join (due to join or other timeout)
>>                 f) again membership protocol is started as 2 responds
>>        to this
>>            by going to gather state ( i believe 2 should not accept
>>        this as 2
>>            would have earlier decided that 3 is failed)
>>                 I am seeing a continuous loop of the above behaviour  (
>>            operational -> membership -> operational -> ) due to which the
>>            cluster is not becoming stabilized
>>            2)   3 Node Cluster
>>                               1---------2-----------3
>>                  a) bring up all the three nodes at the same time (None
>>        of the
>>            nodes have seen each other before this)
>>                  b) Now each node forms a cluster by itself .. (Here i
>>        think it
>>            should from either a (1,2) or (2,3) ring )
>>            Regards,
>>            Ranjith
>>
>>
>>
>>
>>    Ranjith,
>>
>>    Which version of corosync are you running?
>>
>>    can you run corosync-blackbox and attach the output?
>>
>>    Thanks
>>    -steve
>>
>>
>>        _______________________________________________
>>        Openais mailing list
>>        [email protected]
>>        <mailto:[email protected]>
>>
>>        https://lists.linux-foundation.org/mailman/listinfo/openais
>>
>>
>>
>>
>

<<Untitled.png>>

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to