Re: [Openais] Corosync 1.2.8 totem membership behaviour

Ranjith Tue, 05 Oct 2010 02:51:09 -0700

Hi Steve,

Please comment on the below.



Regards,
Ranjith


On Fri, Oct 1, 2010 at 12:04 AM, Ranjith <ranjith.nath...@gmail.com> wrote:

> Hi steve,
>
> Network is like this:
> A (block all packets from src C)
> B
> C (block all packets from src A)
>
>
>
> Nodes
> A,B,C
> A sends join (multicast)
> Only B receives. (C drops it because of ACL)
> B sends join (multicast) (with A,B)
>
> A,C receive join
> C sends join (with A,B,C)
> Only B receives the above
>
> B sends join (with A,B,C)
> A, C sends join (with A,B,C)
> B gets consensus but suppose A is the smallest Id
>
> But A never gets consensus as A cannot get join from C
>
> Am I correct till this point?
>
> Regards,
> Ranjith
>
>
>
>
> On Thu, Sep 30, 2010 at 11:49 PM, Steven Dake <sd...@redhat.com> wrote:
>
>> On 09/30/2010 10:40 AM, Ranjith wrote:
>>
>>> Hi Steve,
>>>
>>> I believe you mean to say that the same acl rules should be applied in
>>> the outgoing side also.
>>> But since here the nodes are not receiving any packet (both multicast
>>> and unicast) from the other, i believe it will also not send to the
>>> other....Is that right?
>>>
>>>
>>>
>> That assumption is incorrect.  Example:
>>
>> Nodes
>> A,B,C
>> A sends join (multicast)
>> B,C receive join
>> B sends join (multicast)
>> A,C receive join
>> C sends join (with A,B,C)
>> now A rejects that message.
>>
>> As a result, the nodes can never come to consensus.
>>
>> Regards
>> -steve
>>
>>  Regards,
>>> Ranjith
>>>
>>> On Thu, Sep 30, 2010 at 10:41 PM, Steven Dake <sd...@redhat.com
>>> <mailto:sd...@redhat.com>> wrote:
>>>
>>>    On 09/30/2010 03:47 AM, Ranjith wrote:
>>>
>>>        Hi all,
>>>
>>>        Kindly let know whether corosync considers the below network as
>>>        byzantine failure i.e the case where N1 and N3 does not have
>>>        connectivity?
>>>        I am testing such scenarios as i believe such a behaviour can
>>>        happen due
>>>        to some misbehaviour in switch (stale arp entries).
>>>
>>>
>>>
>>>    What makes the fault byzantine is that only incoming packets are
>>>    blocked.  If you block both incoming and outgoing packets on the
>>>    nodes, the fault is not byzantine and totem will behave properly.
>>>
>>>    Regards
>>>    -steve
>>>
>>>        Regards,
>>>        Ranjith
>>>
>>>
>>>        Untitled.png
>>>        On Sat, Sep 25, 2010 at 9:47 AM, Ranjith
>>>        <ranjith.nath...@gmail.com <mailto:ranjith.nath...@gmail.com>
>>>        <mailto:ranjith.nath...@gmail.com
>>>        <mailto:ranjith.nath...@gmail.com>>> wrote:
>>>
>>>            Hi Steve,
>>>            Just to make it clear. Do you mean that in the above case If
>>>        N3 is
>>>            part of the network, it should have connectivity to both N2
>>>        and N1
>>>            and if it happens so
>>>            that N3 has connectivity to N2 only, corosync doesnot take
>>>        care of
>>>            the same.
>>>            Regards,
>>>            Ranjith
>>>            On Sat, Sep 25, 2010 at 9:39 AM, Steven Dake
>>>        <sd...@redhat.com <mailto:sd...@redhat.com>
>>>        <mailto:sd...@redhat.com <mailto:sd...@redhat.com>>> wrote:
>>>
>>>                On 09/24/2010 08:20 PM, Ranjith wrote:
>>>
>>>                    Hi ,
>>>                    It is hard to tell what is happening without logs
>>>        from all 3
>>>                    nodes. Does
>>>                    this only happen at system start, or can you duplicate
>>> 5
>>>                    minutes after
>>>                    systems have started?
>>>
>>>         >> The cluster is never stabilizing. It keeps on
>>>                        switching between the
>>>
>>>                    membership and operational state.
>>>                    Below is the test network which i am using:
>>>
>>>                    Untitled.png
>>>
>>>         >> N1 and N3 does not reveive any packets from each
>>>                        other. Here what i
>>>
>>>                    expected was that either (N1,N2) or (N2, N3) forms a
>>> two
>>>                    node cluster
>>>                    and stabilizes. But the cluster is never stabilizing
>>>        even
>>>                    though 2 node
>>>                    clusters are forming, it is going back to membership
>>> [I
>>>                    checked the logs
>>>                    and it looks like because of the steps i mentioned
>>>        in the
>>>                    previous mail,
>>>                    this seems to be happening]
>>>
>>>
>>>
>>>                ......  Where did you say you were testing a byzantine
>>>        fault in
>>>                your original bug report?  Please be more forthcoming in
>>> the
>>>                future. Corosync does not protect against byzantine
>>> faults.
>>>                  Allowing one way connectivity in network connection =
>>> this
>>>                fault scenario.  You can try coro-netctl (the attached
>>>        script)
>>>                which will atomically block a network ip in the network
>>>        to test
>>>                split brain scenarios without actually pulling network
>>>        cables.
>>>
>>>                Regards
>>>                -steve
>>>
>>>
>>>                    Regards,
>>>                    Ranjith
>>>                    On Fri, Sep 24, 2010 at 11:36 PM, Steven Dake
>>>        <sd...@redhat.com <mailto:sd...@redhat.com>
>>>        <mailto:sd...@redhat.com <mailto:sd...@redhat.com>>
>>>        <mailto:sd...@redhat.com <mailto:sd...@redhat.com>
>>>        <mailto:sd...@redhat.com <mailto:sd...@redhat.com>>>> wrote:
>>>
>>>                        It is hard to tell what is happening without
>>>        logs from
>>>                    all 3 nodes.
>>>                        Does this only happen at system start, or can you
>>>                    duplicate 5
>>>                        minutes after systems have started?
>>>
>>>                        If it is at system start, you may need to enable
>>>        "fast
>>>                    STP" on your
>>>                        switch.  It looks to me like node 3 gets some
>>>        messages
>>>                    through but
>>>                        then is blocked.  STP will do this in it's
>>>        default state
>>>                    on most
>>>                        switches.
>>>
>>>                        Another option if you can't enable STP is to use
>>>                    broadcast mode (man
>>>                        openais.conf for details).
>>>
>>>                        Also verify firewalls are properly configured on
>>> all
>>>                    nodes.  You can
>>>                        join us on the irc server freenode on
>>>        #linux-cluster for
>>>                    real-time
>>>                        assistance.
>>>
>>>                        Regards
>>>                        -steve
>>>
>>>
>>>                        On 09/22/2010 11:33 PM, Ranjith wrote:
>>>
>>>                            Hi Steve,
>>>                              I am running corosync 1.2.8
>>>                              I didn't get what u meant by blackbox. I
>>>        suppose it is
>>>                            logs/debugs.
>>>                              I just checked logs/debugs and I am able to
>>>                    understand the below:
>>>
>>>                    1--------------2--------------3
>>>                            1) Node1 and Node2 are already in a 2node
>>>        cluster
>>>                            2) Now Node3 sends join with ({1} , {} )
>>>                    (proc_list/fail_list)
>>>                            3) Node2 sends join ({1,2,3} , {}) and Node
>>> 1/3
>>>                    updates to
>>>                            ({1,2,3}, {})
>>>                            4) Now Node 2 gets consensus after some
>>> messages
>>>                    [But 1 is the rep]
>>>                            5) Consensus timeout fires at node 1 for node
>>> 3,
>>>                    node1 sends join as
>>>                            ({1,2}, {3})
>>>                            6) Node2 updates because of the above message
>>> to
>>>                    ({1,2}, {3})
>>>                            and sends
>>>                            out join. This join received by node 3
>>>        causes it to
>>>                    update
>>>                            ({1,3}, {2})
>>>                            7) Node1and Node2 enter operational (fail list
>>>                    cleared by node2) but
>>>                            node 3 join timeout fires and again
>>>        membership state.
>>>                            8) This will continue to happen until
>>> consensus
>>>                    fires at node3
>>>                            for node1
>>>                            and it moves to ({3}, {1,2})
>>>                            9) Now Node1and Node2 from 2 node cluster and
>>> 3
>>>                    forms a single
>>>                            node cluster
>>>                            10) Now node 2 broadcast a Normal message
>>>                            11) This message is received by Node3 as a
>>>        foreign
>>>                    message which
>>>                            forces
>>>                            it to go to gather state
>>>                            12) Again above steps ....
>>>                            The cluster is never stabilizing.
>>>                            I have attached the debugs for Node2:
>>>                            (1 - 10.102.33.115, 2 - 10.102.33.150, 3
>>>        -10.102.33.180)
>>>                            Regards,
>>>                            Ranjith
>>>
>>>                            On Wed, Sep 22, 2010 at 10:53 PM, Steven Dake
>>>        <sd...@redhat.com <mailto:sd...@redhat.com>
>>>        <mailto:sd...@redhat.com <mailto:sd...@redhat.com>>
>>>        <mailto:sd...@redhat.com <mailto:sd...@redhat.com>
>>>        <mailto:sd...@redhat.com <mailto:sd...@redhat.com>>>
>>>        <mailto:sd...@redhat.com <mailto:sd...@redhat.com>
>>>        <mailto:sd...@redhat.com <mailto:sd...@redhat.com>>
>>>        <mailto:sd...@redhat.com <mailto:sd...@redhat.com>
>>>        <mailto:sd...@redhat.com <mailto:sd...@redhat.com>>>>> wrote:
>>>
>>>                                On 09/21/2010 11:15 PM, Ranjith wrote:
>>>
>>>                                    Hi all,
>>>                                    Kindly comment on the above behaviour
>>>                                    Regards,
>>>                                    Ranjith
>>>
>>>                                    On Tue, Sep 21, 2010 at 9:52 PM,
>>> Ranjith
>>>        <ranjith.nath...@gmail.com <mailto:ranjith.nath...@gmail.com>
>>>        <mailto:ranjith.nath...@gmail.com
>>>        <mailto:ranjith.nath...@gmail.com>>
>>>        <mailto:ranjith.nath...@gmail.com <mailto:
>>> ranjith.nath...@gmail.com>
>>>        <mailto:ranjith.nath...@gmail.com
>>>        <mailto:ranjith.nath...@gmail.com>>>
>>>        <mailto:ranjith.nath...@gmail.com <mailto:
>>> ranjith.nath...@gmail.com>
>>>        <mailto:ranjith.nath...@gmail.com
>>>        <mailto:ranjith.nath...@gmail.com>>
>>>        <mailto:ranjith.nath...@gmail.com <mailto:
>>> ranjith.nath...@gmail.com>
>>>        <mailto:ranjith.nath...@gmail.com
>>>        <mailto:ranjith.nath...@gmail.com>>>>
>>>        <mailto:ranjith.nath...@gmail.com <mailto:
>>> ranjith.nath...@gmail.com>
>>>        <mailto:ranjith.nath...@gmail.com
>>>        <mailto:ranjith.nath...@gmail.com>>
>>>        <mailto:ranjith.nath...@gmail.com <mailto:
>>> ranjith.nath...@gmail.com>
>>>        <mailto:ranjith.nath...@gmail.com
>>>        <mailto:ranjith.nath...@gmail.com>>>
>>>        <mailto:ranjith.nath...@gmail.com <mailto:
>>> ranjith.nath...@gmail.com>
>>>        <mailto:ranjith.nath...@gmail.com
>>>        <mailto:ranjith.nath...@gmail.com>>
>>>        <mailto:ranjith.nath...@gmail.com <mailto:
>>> ranjith.nath...@gmail.com>
>>>        <mailto:ranjith.nath...@gmail.com
>>>        <mailto:ranjith.nath...@gmail.com>>>>>> wrote:
>>>
>>>                                        Hi all,
>>>                                        I was testing the corosync cluster
>>>                    engine by using the
>>>                                    testcpg exec
>>>                                        provided along with the release.
>>>        I am
>>>                    getting the below
>>>                                    behaviour
>>>                                        while testing some specific
>>>        scenarios.
>>>                    Kindly
>>>                            comment on the
>>>                                        expected behaviour.
>>>                                        1)   3 Node cluster
>>>
>>>        1---------2---------3
>>>                                             a) suppose I bring the
>>>        nodes 1&2
>>>                    up, it will form a
>>>                                    ring (1,2)
>>>                                             b) now bring up 3
>>>                                             c) 3 sends join which
>>>        restarts the
>>>                    membership
>>>                            process
>>>                                             d) (1,2) again forms the
>>>        ring , 3
>>>                    forms self
>>>                            cluster
>>>                                             e) now 3 sends a join (due
>>>        to join
>>>                    or other
>>>                            timeout)
>>>                                             f) again membership protocol
>>> is
>>>                    started as 2
>>>                            responds
>>>                                    to this
>>>                                        by going to gather state ( i
>>>        believe 2
>>>                    should not accept
>>>                                    this as 2
>>>                                        would have earlier decided that
>>>        3 is failed)
>>>                                             I am seeing a continuous
>>>        loop of
>>>                    the above
>>>                            behaviour  (
>>>                                        operational -> membership ->
>>>        operational
>>>                    -> ) due to
>>>                            which the
>>>                                        cluster is not becoming stabilized
>>>                                        2)   3 Node Cluster
>>>
>>>        1---------2-----------3
>>>                                              a) bring up all the three
>>>        nodes at
>>>                    the same
>>>                            time (None
>>>                                    of the
>>>                                        nodes have seen each other
>>>        before this)
>>>                                              b) Now each node forms a
>>>        cluster
>>>                    by itself ..
>>>                            (Here i
>>>                                    think it
>>>                                        should from either a (1,2) or
>>>        (2,3) ring )
>>>                                        Regards,
>>>                                        Ranjith
>>>
>>>
>>>
>>>
>>>                                Ranjith,
>>>
>>>                                Which version of corosync are you running?
>>>
>>>                                can you run corosync-blackbox and attach
>>>        the output?
>>>
>>>                                Thanks
>>>                                -steve
>>>
>>>
>>>
>>>          _______________________________________________
>>>                                    Openais mailing list
>>>        Openais@lists.linux-foundation.org
>>>        <mailto:Openais@lists.linux-foundation.org>
>>>        <mailto:Openais@lists.linux-foundation.org
>>>        <mailto:Openais@lists.linux-foundation.org>>
>>>        <mailto:Openais@lists.linux-foundation.org
>>>        <mailto:Openais@lists.linux-foundation.org>
>>>        <mailto:Openais@lists.linux-foundation.org
>>>        <mailto:Openais@lists.linux-foundation.org>>>
>>>        <mailto:Openais@lists.linux-foundation.org
>>>        <mailto:Openais@lists.linux-foundation.org>
>>>        <mailto:Openais@lists.linux-foundation.org
>>>        <mailto:Openais@lists.linux-foundation.org>>
>>>        <mailto:Openais@lists.linux-foundation.org
>>>        <mailto:Openais@lists.linux-foundation.org>
>>>        <mailto:Openais@lists.linux-foundation.org
>>>        <mailto:Openais@lists.linux-foundation.org>>>>
>>>
>>>        https://lists.linux-foundation.org/mailman/listinfo/openais
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync 1.2.8 totem membership behaviour

Reply via email to