[ 
https://issues.apache.org/jira/browse/QPID-5942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Conway resolved QPID-5942.
-------------------------------

    Resolution: Fixed

> qpid HA cluster may end-up in joining state after HA primary is killed
> ----------------------------------------------------------------------
>
>                 Key: QPID-5942
>                 URL: https://issues.apache.org/jira/browse/QPID-5942
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Clustering
>    Affects Versions: 0.28
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>
> See also: https://bugzilla.redhat.com/show_bug.cgi?id=1117823
> Description of problem:
> qpid HA cluster may end-up in joining state after HA primary is killed.
> Test scenario.
> Let's have 3 node qpid HA cluster, all three nodes are operational.
> Then a sender is executed and sending to queue (pure transactional with 
> durable messages and durable queue address).
> During that process primary broker is killed multiple times.
> After N'th primary broker kill cluster is no longer functional as qpid 
> brokers are ending all in joining states:
> [root@dhcp-lab-216 ~]# qpid-ha status --all
> 192.168.6.60:5672 joining
> 192.168.6.61:5672 joining
> 192.168.6.62:5672 joining
> [root@dhcp-x-216 ~]# clustat
> Cluster Status for dtests_ha @ Wed Jul  9 14:38:44 2014
> Member Status: Quorate
>  Member Name                                   ID   Status
>  ------ ----                                   ---- ------
>  192.168.6.60                                      1 Online, Local, rgmanager
>  192.168.6.61                                      2 Online, rgmanager
>  192.168.6.62                                      3 Online, rgmanager
>  Service Name                         Owner (Last)                         
> State         
>  ------- ----                         ----- ------                         
> -----         
>  service:qpidd_1                      192.168.6.60                         
> started       
>  service:qpidd_2                      192.168.6.61                         
> started       
>  service:qpidd_3                      192.168.6.62                         
> started       
>  service:qpidd_primary                (192.168.6.62)                       
> stopped       
> [root@dhcp-x-165 ~]# qpid-ha status --all
> 192.168.6.60:5672 joining
> 192.168.6.61:5672 joining
> 192.168.6.62:5672 joining
> [root@dhcp-x-218 ~]# qpid-ha status --all
> 192.168.6.60:5672 joining
> 192.168.6.61:5672 joining
> 192.168.6.62:5672 joining
> I believe the key to hit the issue is to kill the newly promoted primary soon 
> after it starts appearing in starting/started state in clustat.
> My current understanding is that if we have 3 node cluster then applying any 
> failures to single node at one time should be handled by HA. This is what the 
> testing scenario does:
> A    B    C (nodes)
> pri  bck  bck
> kill 
> bck  pri  bck
>      kill
> bck  bck  pri
>           kill
> ...
> pri  bck  bck
> kill
> bck  bck  bck
> It looks to me that there is short time when promoting new primary when kill 
> causes (of such primary newbee) causes promotion procedure to stuck in all 
> joining.
> I haven't seen such behavior in past, either we are now more sensitive to 
> such case (after -STOP case fixes) or the durability turned on rapidly raises 
> the probability.
> Version-Release number of selected component (if applicable):
> # rpm -qa | grep qpid | sort
> perl-qpid-0.22-13.el6.i686
> perl-qpid-debuginfo-0.22-13.el6.i686
> python-qpid-0.22-15.el6.noarch
> python-qpid-proton-doc-0.5-9.el6.noarch
> python-qpid-qmf-0.22-33.el6.i686
> qpid-cpp-client-0.22-42.el6.i686
> qpid-cpp-client-devel-0.22-42.el6.i686
> qpid-cpp-client-devel-docs-0.22-42.el6.noarch
> qpid-cpp-client-rdma-0.22-42.el6.i686
> qpid-cpp-debuginfo-0.22-42.el6.i686
> qpid-cpp-server-0.22-42.el6.i686
> qpid-cpp-server-devel-0.22-42.el6.i686
> qpid-cpp-server-ha-0.22-42.el6.i686
> qpid-cpp-server-linearstore-0.22-42.el6.i686
> qpid-cpp-server-rdma-0.22-42.el6.i686
> qpid-cpp-server-xml-0.22-42.el6.i686
> qpid-java-client-0.22-6.el6.noarch
> qpid-java-common-0.22-6.el6.noarch
> qpid-java-example-0.22-6.el6.noarch
> qpid-jca-0.22-2.el6.noarch
> qpid-jca-xarecovery-0.22-2.el6.noarch
> qpid-jca-zip-0.22-2.el6.noarch
> qpid-proton-c-0.7-2.el6.i686
> qpid-proton-c-devel-0.7-2.el6.i686
> qpid-proton-c-devel-doc-0.5-9.el6.noarch
> qpid-proton-debuginfo-0.7-2.el6.i686
> qpid-qmf-0.22-33.el6.i686
> qpid-qmf-debuginfo-0.22-33.el6.i686
> qpid-qmf-devel-0.22-33.el6.i686
> qpid-snmpd-1.0.0-16.el6.i686
> qpid-snmpd-debuginfo-1.0.0-16.el6.i686
> qpid-tests-0.22-15.el6.noarch
> qpid-tools-0.22-13.el6.noarch
> ruby-qpid-qmf-0.22-33.el6.i686
> How reproducible:
> rarely, timing is the key
> Steps to Reproduce:
> 1. have configured 3 node cluster
> 2. start the whole cluster up
> 3. execute transactional sender to durable queue address with durable 
> messages and reconnect
> 4. repeatedly kill the primary broker once it is promoted
> Actual results:
>   After few kills cluster ends up not functional all in joining. Ability to 
> bring qpid HA down by inserting single isolated failures to newly being 
> promoted brokers.
> Expected results:
>   Qpid HA should be single failure at one time tolerant.
> Additional info:
>   Details on failure insertion:
>     * kill -9 `pidof qpidd` is the failure action
>     * Assuming the duration between failure insertion and primary is ready to 
> serve named as T1
>     * failure insertion period T2 > T1 i.e. there are no cummulative failures 
> inserted while HA is getting through new primary promotion
>       -> this fact (in my view) proves that there is real issue



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to