[
https://issues.apache.org/jira/browse/QPID-5942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alan Conway resolved QPID-5942.
-------------------------------
Resolution: Fixed
> qpid HA cluster may end-up in joining state after HA primary is killed
> ----------------------------------------------------------------------
>
> Key: QPID-5942
> URL: https://issues.apache.org/jira/browse/QPID-5942
> Project: Qpid
> Issue Type: Bug
> Components: C++ Clustering
> Affects Versions: 0.28
> Reporter: Alan Conway
> Assignee: Alan Conway
>
> See also: https://bugzilla.redhat.com/show_bug.cgi?id=1117823
> Description of problem:
> qpid HA cluster may end-up in joining state after HA primary is killed.
> Test scenario.
> Let's have 3 node qpid HA cluster, all three nodes are operational.
> Then a sender is executed and sending to queue (pure transactional with
> durable messages and durable queue address).
> During that process primary broker is killed multiple times.
> After N'th primary broker kill cluster is no longer functional as qpid
> brokers are ending all in joining states:
> [root@dhcp-lab-216 ~]# qpid-ha status --all
> 192.168.6.60:5672 joining
> 192.168.6.61:5672 joining
> 192.168.6.62:5672 joining
> [root@dhcp-x-216 ~]# clustat
> Cluster Status for dtests_ha @ Wed Jul 9 14:38:44 2014
> Member Status: Quorate
> Member Name ID Status
> ------ ---- ---- ------
> 192.168.6.60 1 Online, Local, rgmanager
> 192.168.6.61 2 Online, rgmanager
> 192.168.6.62 3 Online, rgmanager
> Service Name Owner (Last)
> State
> ------- ---- ----- ------
> -----
> service:qpidd_1 192.168.6.60
> started
> service:qpidd_2 192.168.6.61
> started
> service:qpidd_3 192.168.6.62
> started
> service:qpidd_primary (192.168.6.62)
> stopped
> [root@dhcp-x-165 ~]# qpid-ha status --all
> 192.168.6.60:5672 joining
> 192.168.6.61:5672 joining
> 192.168.6.62:5672 joining
> [root@dhcp-x-218 ~]# qpid-ha status --all
> 192.168.6.60:5672 joining
> 192.168.6.61:5672 joining
> 192.168.6.62:5672 joining
> I believe the key to hit the issue is to kill the newly promoted primary soon
> after it starts appearing in starting/started state in clustat.
> My current understanding is that if we have 3 node cluster then applying any
> failures to single node at one time should be handled by HA. This is what the
> testing scenario does:
> A B C (nodes)
> pri bck bck
> kill
> bck pri bck
> kill
> bck bck pri
> kill
> ...
> pri bck bck
> kill
> bck bck bck
> It looks to me that there is short time when promoting new primary when kill
> causes (of such primary newbee) causes promotion procedure to stuck in all
> joining.
> I haven't seen such behavior in past, either we are now more sensitive to
> such case (after -STOP case fixes) or the durability turned on rapidly raises
> the probability.
> Version-Release number of selected component (if applicable):
> # rpm -qa | grep qpid | sort
> perl-qpid-0.22-13.el6.i686
> perl-qpid-debuginfo-0.22-13.el6.i686
> python-qpid-0.22-15.el6.noarch
> python-qpid-proton-doc-0.5-9.el6.noarch
> python-qpid-qmf-0.22-33.el6.i686
> qpid-cpp-client-0.22-42.el6.i686
> qpid-cpp-client-devel-0.22-42.el6.i686
> qpid-cpp-client-devel-docs-0.22-42.el6.noarch
> qpid-cpp-client-rdma-0.22-42.el6.i686
> qpid-cpp-debuginfo-0.22-42.el6.i686
> qpid-cpp-server-0.22-42.el6.i686
> qpid-cpp-server-devel-0.22-42.el6.i686
> qpid-cpp-server-ha-0.22-42.el6.i686
> qpid-cpp-server-linearstore-0.22-42.el6.i686
> qpid-cpp-server-rdma-0.22-42.el6.i686
> qpid-cpp-server-xml-0.22-42.el6.i686
> qpid-java-client-0.22-6.el6.noarch
> qpid-java-common-0.22-6.el6.noarch
> qpid-java-example-0.22-6.el6.noarch
> qpid-jca-0.22-2.el6.noarch
> qpid-jca-xarecovery-0.22-2.el6.noarch
> qpid-jca-zip-0.22-2.el6.noarch
> qpid-proton-c-0.7-2.el6.i686
> qpid-proton-c-devel-0.7-2.el6.i686
> qpid-proton-c-devel-doc-0.5-9.el6.noarch
> qpid-proton-debuginfo-0.7-2.el6.i686
> qpid-qmf-0.22-33.el6.i686
> qpid-qmf-debuginfo-0.22-33.el6.i686
> qpid-qmf-devel-0.22-33.el6.i686
> qpid-snmpd-1.0.0-16.el6.i686
> qpid-snmpd-debuginfo-1.0.0-16.el6.i686
> qpid-tests-0.22-15.el6.noarch
> qpid-tools-0.22-13.el6.noarch
> ruby-qpid-qmf-0.22-33.el6.i686
> How reproducible:
> rarely, timing is the key
> Steps to Reproduce:
> 1. have configured 3 node cluster
> 2. start the whole cluster up
> 3. execute transactional sender to durable queue address with durable
> messages and reconnect
> 4. repeatedly kill the primary broker once it is promoted
> Actual results:
> After few kills cluster ends up not functional all in joining. Ability to
> bring qpid HA down by inserting single isolated failures to newly being
> promoted brokers.
> Expected results:
> Qpid HA should be single failure at one time tolerant.
> Additional info:
> Details on failure insertion:
> * kill -9 `pidof qpidd` is the failure action
> * Assuming the duration between failure insertion and primary is ready to
> serve named as T1
> * failure insertion period T2 > T1 i.e. there are no cummulative failures
> inserted while HA is getting through new primary promotion
> -> this fact (in my view) proves that there is real issue
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]