Re: [Openais] Corosync 1.2.8 totem membership behaviour

Steven Dake Thu, 30 Sep 2010 11:18:21 -0700

On 09/30/2010 10:40 AM, Ranjith wrote:
> Hi Steve,
>
> I believe you mean to say that the same acl rules should be applied in
> the outgoing side also.
> But since here the nodes are not receiving any packet (both multicast
> and unicast) from the other, i believe it will also not send to the
> other....Is that right?
>
>


That assumption is incorrect.  Example:

Nodes
A,B,C
A sends join (multicast)
B,C receive join
B sends join (multicast)
A,C receive join
C sends join (with A,B,C)
now A rejects that message.

As a result, the nodes can never come to consensus.

Regards
-steve

> Regards,
> Ranjith
>
> On Thu, Sep 30, 2010 at 10:41 PM, Steven Dake <[email protected]
> <mailto:[email protected]>> wrote:
>
>     On 09/30/2010 03:47 AM, Ranjith wrote:
>
>         Hi all,
>
>         Kindly let know whether corosync considers the below network as
>         byzantine failure i.e the case where N1 and N3 does not have
>         connectivity?
>         I am testing such scenarios as i believe such a behaviour can
>         happen due
>         to some misbehaviour in switch (stale arp entries).
>
>
>
>     What makes the fault byzantine is that only incoming packets are
>     blocked.  If you block both incoming and outgoing packets on the
>     nodes, the fault is not byzantine and totem will behave properly.
>
>     Regards
>     -steve
>
>         Regards,
>         Ranjith
>
>
>         Untitled.png
>         On Sat, Sep 25, 2010 at 9:47 AM, Ranjith
>         <[email protected] <mailto:[email protected]>
>         <mailto:[email protected]
>         <mailto:[email protected]>>> wrote:
>
>             Hi Steve,
>             Just to make it clear. Do you mean that in the above case If
>         N3 is
>             part of the network, it should have connectivity to both N2
>         and N1
>             and if it happens so
>             that N3 has connectivity to N2 only, corosync doesnot take
>         care of
>             the same.
>             Regards,
>             Ranjith
>             On Sat, Sep 25, 2010 at 9:39 AM, Steven Dake
>         <[email protected] <mailto:[email protected]>
>         <mailto:[email protected] <mailto:[email protected]>>> wrote:
>
>                 On 09/24/2010 08:20 PM, Ranjith wrote:
>
>                     Hi ,
>                     It is hard to tell what is happening without logs
>         from all 3
>                     nodes. Does
>                     this only happen at system start, or can you duplicate 5
>                     minutes after
>                     systems have started?
>
>          >> The cluster is never stabilizing. It keeps on
>                         switching between the
>
>                     membership and operational state.
>                     Below is the test network which i am using:
>
>                     Untitled.png
>
>          >> N1 and N3 does not reveive any packets from each
>                         other. Here what i
>
>                     expected was that either (N1,N2) or (N2, N3) forms a two
>                     node cluster
>                     and stabilizes. But the cluster is never stabilizing
>         even
>                     though 2 node
>                     clusters are forming, it is going back to membership [I
>                     checked the logs
>                     and it looks like because of the steps i mentioned
>         in the
>                     previous mail,
>                     this seems to be happening]
>
>
>
>                 ......  Where did you say you were testing a byzantine
>         fault in
>                 your original bug report?  Please be more forthcoming in the
>                 future. Corosync does not protect against byzantine faults.
>                   Allowing one way connectivity in network connection = this
>                 fault scenario.  You can try coro-netctl (the attached
>         script)
>                 which will atomically block a network ip in the network
>         to test
>                 split brain scenarios without actually pulling network
>         cables.
>
>                 Regards
>                 -steve
>
>
>                     Regards,
>                     Ranjith
>                     On Fri, Sep 24, 2010 at 11:36 PM, Steven Dake
>         <[email protected] <mailto:[email protected]>
>         <mailto:[email protected] <mailto:[email protected]>>
>         <mailto:[email protected] <mailto:[email protected]>
>         <mailto:[email protected] <mailto:[email protected]>>>> wrote:
>
>                         It is hard to tell what is happening without
>         logs from
>                     all 3 nodes.
>                         Does this only happen at system start, or can you
>                     duplicate 5
>                         minutes after systems have started?
>
>                         If it is at system start, you may need to enable
>         "fast
>                     STP" on your
>                         switch.  It looks to me like node 3 gets some
>         messages
>                     through but
>                         then is blocked.  STP will do this in it's
>         default state
>                     on most
>                         switches.
>
>                         Another option if you can't enable STP is to use
>                     broadcast mode (man
>                         openais.conf for details).
>
>                         Also verify firewalls are properly configured on all
>                     nodes.  You can
>                         join us on the irc server freenode on
>         #linux-cluster for
>                     real-time
>                         assistance.
>
>                         Regards
>                         -steve
>
>
>                         On 09/22/2010 11:33 PM, Ranjith wrote:
>
>                             Hi Steve,
>                               I am running corosync 1.2.8
>                               I didn't get what u meant by blackbox. I
>         suppose it is
>                             logs/debugs.
>                               I just checked logs/debugs and I am able to
>                     understand the below:
>
>                     1--------------2--------------3
>                             1) Node1 and Node2 are already in a 2node
>         cluster
>                             2) Now Node3 sends join with ({1} , {} )
>                     (proc_list/fail_list)
>                             3) Node2 sends join ({1,2,3} , {}) and Node 1/3
>                     updates to
>                             ({1,2,3}, {})
>                             4) Now Node 2 gets consensus after some messages
>                     [But 1 is the rep]
>                             5) Consensus timeout fires at node 1 for node 3,
>                     node1 sends join as
>                             ({1,2}, {3})
>                             6) Node2 updates because of the above message to
>                     ({1,2}, {3})
>                             and sends
>                             out join. This join received by node 3
>         causes it to
>                     update
>                             ({1,3}, {2})
>                             7) Node1and Node2 enter operational (fail list
>                     cleared by node2) but
>                             node 3 join timeout fires and again
>         membership state.
>                             8) This will continue to happen until consensus
>                     fires at node3
>                             for node1
>                             and it moves to ({3}, {1,2})
>                             9) Now Node1and Node2 from 2 node cluster and 3
>                     forms a single
>                             node cluster
>                             10) Now node 2 broadcast a Normal message
>                             11) This message is received by Node3 as a
>         foreign
>                     message which
>                             forces
>                             it to go to gather state
>                             12) Again above steps ....
>                             The cluster is never stabilizing.
>                             I have attached the debugs for Node2:
>                             (1 - 10.102.33.115, 2 - 10.102.33.150, 3
>         -10.102.33.180)
>                             Regards,
>                             Ranjith
>
>                             On Wed, Sep 22, 2010 at 10:53 PM, Steven Dake
>         <[email protected] <mailto:[email protected]>
>         <mailto:[email protected] <mailto:[email protected]>>
>         <mailto:[email protected] <mailto:[email protected]>
>         <mailto:[email protected] <mailto:[email protected]>>>
>         <mailto:[email protected] <mailto:[email protected]>
>         <mailto:[email protected] <mailto:[email protected]>>
>         <mailto:[email protected] <mailto:[email protected]>
>         <mailto:[email protected] <mailto:[email protected]>>>>> wrote:
>
>                                 On 09/21/2010 11:15 PM, Ranjith wrote:
>
>                                     Hi all,
>                                     Kindly comment on the above behaviour
>                                     Regards,
>                                     Ranjith
>
>                                     On Tue, Sep 21, 2010 at 9:52 PM, Ranjith
>         <[email protected] <mailto:[email protected]>
>         <mailto:[email protected]
>         <mailto:[email protected]>>
>         <mailto:[email protected] <mailto:[email protected]>
>         <mailto:[email protected]
>         <mailto:[email protected]>>>
>         <mailto:[email protected] <mailto:[email protected]>
>         <mailto:[email protected]
>         <mailto:[email protected]>>
>         <mailto:[email protected] <mailto:[email protected]>
>         <mailto:[email protected]
>         <mailto:[email protected]>>>>
>         <mailto:[email protected] <mailto:[email protected]>
>         <mailto:[email protected]
>         <mailto:[email protected]>>
>         <mailto:[email protected] <mailto:[email protected]>
>         <mailto:[email protected]
>         <mailto:[email protected]>>>
>         <mailto:[email protected] <mailto:[email protected]>
>         <mailto:[email protected]
>         <mailto:[email protected]>>
>         <mailto:[email protected] <mailto:[email protected]>
>         <mailto:[email protected]
>         <mailto:[email protected]>>>>>> wrote:
>
>                                         Hi all,
>                                         I was testing the corosync cluster
>                     engine by using the
>                                     testcpg exec
>                                         provided along with the release.
>         I am
>                     getting the below
>                                     behaviour
>                                         while testing some specific
>         scenarios.
>                     Kindly
>                             comment on the
>                                         expected behaviour.
>                                         1)   3 Node cluster
>
>         1---------2---------3
>                                              a) suppose I bring the
>         nodes 1&2
>                     up, it will form a
>                                     ring (1,2)
>                                              b) now bring up 3
>                                              c) 3 sends join which
>         restarts the
>                     membership
>                             process
>                                              d) (1,2) again forms the
>         ring , 3
>                     forms self
>                             cluster
>                                              e) now 3 sends a join (due
>         to join
>                     or other
>                             timeout)
>                                              f) again membership protocol is
>                     started as 2
>                             responds
>                                     to this
>                                         by going to gather state ( i
>         believe 2
>                     should not accept
>                                     this as 2
>                                         would have earlier decided that
>         3 is failed)
>                                              I am seeing a continuous
>         loop of
>                     the above
>                             behaviour  (
>                                         operational -> membership ->
>         operational
>                     -> ) due to
>                             which the
>                                         cluster is not becoming stabilized
>                                         2)   3 Node Cluster
>
>         1---------2-----------3
>                                               a) bring up all the three
>         nodes at
>                     the same
>                             time (None
>                                     of the
>                                         nodes have seen each other
>         before this)
>                                               b) Now each node forms a
>         cluster
>                     by itself ..
>                             (Here i
>                                     think it
>                                         should from either a (1,2) or
>         (2,3) ring )
>                                         Regards,
>                                         Ranjith
>
>
>
>
>                                 Ranjith,
>
>                                 Which version of corosync are you running?
>
>                                 can you run corosync-blackbox and attach
>         the output?
>
>                                 Thanks
>                                 -steve
>
>
>
>           _______________________________________________
>                                     Openais mailing list
>         [email protected]
>         <mailto:[email protected]>
>         <mailto:[email protected]
>         <mailto:[email protected]>>
>         <mailto:[email protected]
>         <mailto:[email protected]>
>         <mailto:[email protected]
>         <mailto:[email protected]>>>
>         <mailto:[email protected]
>         <mailto:[email protected]>
>         <mailto:[email protected]
>         <mailto:[email protected]>>
>         <mailto:[email protected]
>         <mailto:[email protected]>
>         <mailto:[email protected]
>         <mailto:[email protected]>>>>
>
>         https://lists.linux-foundation.org/mailman/listinfo/openais
>
>
>
>
>
>
>
>
>
>

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync 1.2.8 totem membership behaviour

Reply via email to