It is hard to tell what is happening without logs from all 3 nodes. 
Does this only happen at system start, or can you duplicate 5 minutes 
after systems have started?

If it is at system start, you may need to enable "fast STP" on your 
switch.  It looks to me like node 3 gets some messages through but then 
is blocked.  STP will do this in it's default state on most switches.

Another option if you can't enable STP is to use broadcast mode (man 
openais.conf for details).

Also verify firewalls are properly configured on all nodes.  You can 
join us on the irc server freenode on #linux-cluster for real-time 
assistance.

Regards
-steve

On 09/22/2010 11:33 PM, Ranjith wrote:
> Hi Steve,
>   I am running corosync 1.2.8
>   I didn't get what u meant by blackbox. I suppose it is logs/debugs.
>   I just checked logs/debugs and I am able to understand the below:
>                                    1--------------2--------------3
> 1) Node1 and Node2 are already in a 2node cluster
> 2) Now Node3 sends join with ({1} , {} ) (proc_list/fail_list)
> 3) Node2 sends join ({1,2,3} , {}) and Node 1/3 updates to ({1,2,3}, {})
> 4) Now Node 2 gets consensus after some messages [But 1 is the rep]
> 5) Consensus timeout fires at node 1 for node 3, node1 sends join as
> ({1,2}, {3})
> 6) Node2 updates because of the above message to ({1,2}, {3}) and sends
> out join. This join received by node 3 causes it to update ({1,3}, {2})
> 7) Node1and Node2 enter operational (fail list cleared by node2) but
> node 3 join timeout fires and again membership state.
> 8) This will continue to happen until consensus fires at node3 for node1
> and it moves to ({3}, {1,2})
> 9) Now Node1and Node2 from 2 node cluster and 3 forms a single node cluster
> 10) Now node 2 broadcast a Normal message
> 11) This message is received by Node3 as a foreign message which forces
> it to go to gather state
> 12) Again above steps ....
> The cluster is never stabilizing.
> I have attached the debugs for Node2:
> (1 - 10.102.33.115, 2 - 10.102.33.150, 3 -10.102.33.180)
> Regards,
> Ranjith
>
> On Wed, Sep 22, 2010 at 10:53 PM, Steven Dake <[email protected]
> <mailto:[email protected]>> wrote:
>
>     On 09/21/2010 11:15 PM, Ranjith wrote:
>
>         Hi all,
>         Kindly comment on the above behaviour
>         Regards,
>         Ranjith
>
>         On Tue, Sep 21, 2010 at 9:52 PM, Ranjith
>         <[email protected] <mailto:[email protected]>
>         <mailto:[email protected]
>         <mailto:[email protected]>>> wrote:
>
>             Hi all,
>             I was testing the corosync cluster engine by using the
>         testcpg exec
>             provided along with the release. I am getting the below
>         behaviour
>             while testing some specific scenarios. Kindly comment on the
>             expected behaviour.
>             1)   3 Node cluster
>                                1---------2---------3
>                  a) suppose I bring the nodes 1&2 up, it will form a
>         ring (1,2)
>                  b) now bring up 3
>                  c) 3 sends join which restarts the membership process
>                  d) (1,2) again forms the ring , 3 forms self cluster
>                  e) now 3 sends a join (due to join or other timeout)
>                  f) again membership protocol is started as 2 responds
>         to this
>             by going to gather state ( i believe 2 should not accept
>         this as 2
>             would have earlier decided that 3 is failed)
>                  I am seeing a continuous loop of the above behaviour  (
>             operational -> membership -> operational -> ) due to which the
>             cluster is not becoming stabilized
>             2)   3 Node Cluster
>                                1---------2-----------3
>                   a) bring up all the three nodes at the same time (None
>         of the
>             nodes have seen each other before this)
>                   b) Now each node forms a cluster by itself .. (Here i
>         think it
>             should from either a (1,2) or (2,3) ring )
>             Regards,
>             Ranjith
>
>
>
>
>     Ranjith,
>
>     Which version of corosync are you running?
>
>     can you run corosync-blackbox and attach the output?
>
>     Thanks
>     -steve
>
>
>         _______________________________________________
>         Openais mailing list
>         [email protected]
>         <mailto:[email protected]>
>         https://lists.linux-foundation.org/mailman/listinfo/openais
>
>
>

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to