Re: [Openais] Corosync 1.2.8 totem membership behaviour

Ranjith Thu, 30 Sep 2010 11:56:05 -0700

Hi steve,

Network is like this:
A (block all packets from src C)
B
C (block all packets from src A)



Nodes
A,B,C
A sends join (multicast)
Only B receives. (C drops it because of ACL)
B sends join (multicast) (with A,B)
A,C receive join
C sends join (with A,B,C)
Only B receives the above
B sends join (with A,B,C)
A, C sends join (with A,B,C)
B gets consensus but suppose A is the smallest Id

But A never gets consensus as A cannot get join from C

Am I correct till this point?

Regards,
Ranjith



On Thu, Sep 30, 2010 at 11:49 PM, Steven Dake <[email protected]> wrote:

> On 09/30/2010 10:40 AM, Ranjith wrote:
>
>> Hi Steve,
>>
>> I believe you mean to say that the same acl rules should be applied in
>> the outgoing side also.
>> But since here the nodes are not receiving any packet (both multicast
>> and unicast) from the other, i believe it will also not send to the
>> other....Is that right?
>>
>>
>>
> That assumption is incorrect.  Example:
>
> Nodes
> A,B,C
> A sends join (multicast)
> B,C receive join
> B sends join (multicast)
> A,C receive join
> C sends join (with A,B,C)
> now A rejects that message.
>
> As a result, the nodes can never come to consensus.
>
> Regards
> -steve
>
>  Regards,
>> Ranjith
>>
>> On Thu, Sep 30, 2010 at 10:41 PM, Steven Dake <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>    On 09/30/2010 03:47 AM, Ranjith wrote:
>>
>>        Hi all,
>>
>>        Kindly let know whether corosync considers the below network as
>>        byzantine failure i.e the case where N1 and N3 does not have
>>        connectivity?
>>        I am testing such scenarios as i believe such a behaviour can
>>        happen due
>>        to some misbehaviour in switch (stale arp entries).
>>
>>
>>
>>    What makes the fault byzantine is that only incoming packets are
>>    blocked.  If you block both incoming and outgoing packets on the
>>    nodes, the fault is not byzantine and totem will behave properly.
>>
>>    Regards
>>    -steve
>>
>>        Regards,
>>        Ranjith
>>
>>
>>        Untitled.png
>>        On Sat, Sep 25, 2010 at 9:47 AM, Ranjith
>>        <[email protected] <mailto:[email protected]>
>>        <mailto:[email protected]
>>        <mailto:[email protected]>>> wrote:
>>
>>            Hi Steve,
>>            Just to make it clear. Do you mean that in the above case If
>>        N3 is
>>            part of the network, it should have connectivity to both N2
>>        and N1
>>            and if it happens so
>>            that N3 has connectivity to N2 only, corosync doesnot take
>>        care of
>>            the same.
>>            Regards,
>>            Ranjith
>>            On Sat, Sep 25, 2010 at 9:39 AM, Steven Dake
>>        <[email protected] <mailto:[email protected]>
>>        <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>
>>                On 09/24/2010 08:20 PM, Ranjith wrote:
>>
>>                    Hi ,
>>                    It is hard to tell what is happening without logs
>>        from all 3
>>                    nodes. Does
>>                    this only happen at system start, or can you duplicate
>> 5
>>                    minutes after
>>                    systems have started?
>>
>>         >> The cluster is never stabilizing. It keeps on
>>                        switching between the
>>
>>                    membership and operational state.
>>                    Below is the test network which i am using:
>>
>>                    Untitled.png
>>
>>         >> N1 and N3 does not reveive any packets from each
>>                        other. Here what i
>>
>>                    expected was that either (N1,N2) or (N2, N3) forms a
>> two
>>                    node cluster
>>                    and stabilizes. But the cluster is never stabilizing
>>        even
>>                    though 2 node
>>                    clusters are forming, it is going back to membership [I
>>                    checked the logs
>>                    and it looks like because of the steps i mentioned
>>        in the
>>                    previous mail,
>>                    this seems to be happening]
>>
>>
>>
>>                ......  Where did you say you were testing a byzantine
>>        fault in
>>                your original bug report?  Please be more forthcoming in
>> the
>>                future. Corosync does not protect against byzantine faults.
>>                  Allowing one way connectivity in network connection =
>> this
>>                fault scenario.  You can try coro-netctl (the attached
>>        script)
>>                which will atomically block a network ip in the network
>>        to test
>>                split brain scenarios without actually pulling network
>>        cables.
>>
>>                Regards
>>                -steve
>>
>>
>>                    Regards,
>>                    Ranjith
>>                    On Fri, Sep 24, 2010 at 11:36 PM, Steven Dake
>>        <[email protected] <mailto:[email protected]>
>>        <mailto:[email protected] <mailto:[email protected]>>
>>        <mailto:[email protected] <mailto:[email protected]>
>>        <mailto:[email protected] <mailto:[email protected]>>>> wrote:
>>
>>                        It is hard to tell what is happening without
>>        logs from
>>                    all 3 nodes.
>>                        Does this only happen at system start, or can you
>>                    duplicate 5
>>                        minutes after systems have started?
>>
>>                        If it is at system start, you may need to enable
>>        "fast
>>                    STP" on your
>>                        switch.  It looks to me like node 3 gets some
>>        messages
>>                    through but
>>                        then is blocked.  STP will do this in it's
>>        default state
>>                    on most
>>                        switches.
>>
>>                        Another option if you can't enable STP is to use
>>                    broadcast mode (man
>>                        openais.conf for details).
>>
>>                        Also verify firewalls are properly configured on
>> all
>>                    nodes.  You can
>>                        join us on the irc server freenode on
>>        #linux-cluster for
>>                    real-time
>>                        assistance.
>>
>>                        Regards
>>                        -steve
>>
>>
>>                        On 09/22/2010 11:33 PM, Ranjith wrote:
>>
>>                            Hi Steve,
>>                              I am running corosync 1.2.8
>>                              I didn't get what u meant by blackbox. I
>>        suppose it is
>>                            logs/debugs.
>>                              I just checked logs/debugs and I am able to
>>                    understand the below:
>>
>>                    1--------------2--------------3
>>                            1) Node1 and Node2 are already in a 2node
>>        cluster
>>                            2) Now Node3 sends join with ({1} , {} )
>>                    (proc_list/fail_list)
>>                            3) Node2 sends join ({1,2,3} , {}) and Node 1/3
>>                    updates to
>>                            ({1,2,3}, {})
>>                            4) Now Node 2 gets consensus after some
>> messages
>>                    [But 1 is the rep]
>>                            5) Consensus timeout fires at node 1 for node
>> 3,
>>                    node1 sends join as
>>                            ({1,2}, {3})
>>                            6) Node2 updates because of the above message
>> to
>>                    ({1,2}, {3})
>>                            and sends
>>                            out join. This join received by node 3
>>        causes it to
>>                    update
>>                            ({1,3}, {2})
>>                            7) Node1and Node2 enter operational (fail list
>>                    cleared by node2) but
>>                            node 3 join timeout fires and again
>>        membership state.
>>                            8) This will continue to happen until consensus
>>                    fires at node3
>>                            for node1
>>                            and it moves to ({3}, {1,2})
>>                            9) Now Node1and Node2 from 2 node cluster and 3
>>                    forms a single
>>                            node cluster
>>                            10) Now node 2 broadcast a Normal message
>>                            11) This message is received by Node3 as a
>>        foreign
>>                    message which
>>                            forces
>>                            it to go to gather state
>>                            12) Again above steps ....
>>                            The cluster is never stabilizing.
>>                            I have attached the debugs for Node2:
>>                            (1 - 10.102.33.115, 2 - 10.102.33.150, 3
>>        -10.102.33.180)
>>                            Regards,
>>                            Ranjith
>>
>>                            On Wed, Sep 22, 2010 at 10:53 PM, Steven Dake
>>        <[email protected] <mailto:[email protected]>
>>        <mailto:[email protected] <mailto:[email protected]>>
>>        <mailto:[email protected] <mailto:[email protected]>
>>        <mailto:[email protected] <mailto:[email protected]>>>
>>        <mailto:[email protected] <mailto:[email protected]>
>>        <mailto:[email protected] <mailto:[email protected]>>
>>        <mailto:[email protected] <mailto:[email protected]>
>>        <mailto:[email protected] <mailto:[email protected]>>>>> wrote:
>>
>>                                On 09/21/2010 11:15 PM, Ranjith wrote:
>>
>>                                    Hi all,
>>                                    Kindly comment on the above behaviour
>>                                    Regards,
>>                                    Ranjith
>>
>>                                    On Tue, Sep 21, 2010 at 9:52 PM,
>> Ranjith
>>        <[email protected] <mailto:[email protected]>
>>        <mailto:[email protected]
>>        <mailto:[email protected]>>
>>        <mailto:[email protected] <mailto:
>> [email protected]>
>>        <mailto:[email protected]
>>        <mailto:[email protected]>>>
>>        <mailto:[email protected] <mailto:
>> [email protected]>
>>        <mailto:[email protected]
>>        <mailto:[email protected]>>
>>        <mailto:[email protected] <mailto:
>> [email protected]>
>>        <mailto:[email protected]
>>        <mailto:[email protected]>>>>
>>        <mailto:[email protected] <mailto:
>> [email protected]>
>>        <mailto:[email protected]
>>        <mailto:[email protected]>>
>>        <mailto:[email protected] <mailto:
>> [email protected]>
>>        <mailto:[email protected]
>>        <mailto:[email protected]>>>
>>        <mailto:[email protected] <mailto:
>> [email protected]>
>>        <mailto:[email protected]
>>        <mailto:[email protected]>>
>>        <mailto:[email protected] <mailto:
>> [email protected]>
>>        <mailto:[email protected]
>>        <mailto:[email protected]>>>>>> wrote:
>>
>>                                        Hi all,
>>                                        I was testing the corosync cluster
>>                    engine by using the
>>                                    testcpg exec
>>                                        provided along with the release.
>>        I am
>>                    getting the below
>>                                    behaviour
>>                                        while testing some specific
>>        scenarios.
>>                    Kindly
>>                            comment on the
>>                                        expected behaviour.
>>                                        1)   3 Node cluster
>>
>>        1---------2---------3
>>                                             a) suppose I bring the
>>        nodes 1&2
>>                    up, it will form a
>>                                    ring (1,2)
>>                                             b) now bring up 3
>>                                             c) 3 sends join which
>>        restarts the
>>                    membership
>>                            process
>>                                             d) (1,2) again forms the
>>        ring , 3
>>                    forms self
>>                            cluster
>>                                             e) now 3 sends a join (due
>>        to join
>>                    or other
>>                            timeout)
>>                                             f) again membership protocol
>> is
>>                    started as 2
>>                            responds
>>                                    to this
>>                                        by going to gather state ( i
>>        believe 2
>>                    should not accept
>>                                    this as 2
>>                                        would have earlier decided that
>>        3 is failed)
>>                                             I am seeing a continuous
>>        loop of
>>                    the above
>>                            behaviour  (
>>                                        operational -> membership ->
>>        operational
>>                    -> ) due to
>>                            which the
>>                                        cluster is not becoming stabilized
>>                                        2)   3 Node Cluster
>>
>>        1---------2-----------3
>>                                              a) bring up all the three
>>        nodes at
>>                    the same
>>                            time (None
>>                                    of the
>>                                        nodes have seen each other
>>        before this)
>>                                              b) Now each node forms a
>>        cluster
>>                    by itself ..
>>                            (Here i
>>                                    think it
>>                                        should from either a (1,2) or
>>        (2,3) ring )
>>                                        Regards,
>>                                        Ranjith
>>
>>
>>
>>
>>                                Ranjith,
>>
>>                                Which version of corosync are you running?
>>
>>                                can you run corosync-blackbox and attach
>>        the output?
>>
>>                                Thanks
>>                                -steve
>>
>>
>>
>>          _______________________________________________
>>                                    Openais mailing list
>>        [email protected]
>>        <mailto:[email protected]>
>>        <mailto:[email protected]
>>        <mailto:[email protected]>>
>>        <mailto:[email protected]
>>        <mailto:[email protected]>
>>        <mailto:[email protected]
>>        <mailto:[email protected]>>>
>>        <mailto:[email protected]
>>        <mailto:[email protected]>
>>        <mailto:[email protected]
>>        <mailto:[email protected]>>
>>        <mailto:[email protected]
>>        <mailto:[email protected]>
>>        <mailto:[email protected]
>>        <mailto:[email protected]>>>>
>>
>>        https://lists.linux-foundation.org/mailman/listinfo/openais
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync 1.2.8 totem membership behaviour

Reply via email to