Alan Conway created QPID-5904:
---------------------------------
Summary: qpid HA cluster may end-up in joining state after HA
primary is killed
Key: QPID-5904
URL: https://issues.apache.org/jira/browse/QPID-5904
Project: Qpid
Issue Type: Bug
Components: C++ Clustering
Affects Versions: 0.28
Reporter: Alan Conway
Assignee: Alan Conway
Frantisek Reznicek 2014-07-09 08:59:30 EDT
Description of problem:
qpid HA cluster may end-up in joining state after HA primary is killed.
Test scenario.
Let's have 3 node qpid HA cluster, all three nodes are operational.
Then a sender is executed and sending to queue (pure transactional with durable
messages and durable queue address).
During that process primary broker is killed multiple times.
After N'th primary broker kill cluster is no longer functional as qpid brokers
are ending all in joining states:
[root@dhcp-lab-216 ~]# qpid-ha status --all
192.168.6.60:5672 joining
192.168.6.61:5672 joining
192.168.6.62:5672 joining
[root@dhcp-x-216 ~]# clustat
Cluster Status for dtests_ha @ Wed Jul 9 14:38:44 2014
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
192.168.6.60 1 Online, Local, rgmanager
192.168.6.61 2 Online, rgmanager
192.168.6.62 3 Online, rgmanager
Service Name Owner (Last)
State
------- ---- ----- ------
-----
service:qpidd_1 192.168.6.60
started
service:qpidd_2 192.168.6.61
started
service:qpidd_3 192.168.6.62
started
service:qpidd_primary (192.168.6.62)
stopped
[root@dhcp-x-165 ~]# qpid-ha status --all
192.168.6.60:5672 joining
192.168.6.61:5672 joining
192.168.6.62:5672 joining
[root@dhcp-x-218 ~]# qpid-ha status --all
192.168.6.60:5672 joining
192.168.6.61:5672 joining
192.168.6.62:5672 joining
I believe the key to hit the issue is to kill the newly promoted primary soon
after it starts appearing in starting/started state in clustat.
My current understanding is that if we have 3 node cluster then applying any
failures to single node at one time should be handled by HA. This is what the
testing scenario does:
A B C (nodes)
pri bck bck
kill
bck pri bck
kill
bck bck pri
kill
...
pri bck bck
kill
bck bck bck
It looks to me that there is short time when promoting new primary when kill
causes (of such primary newbee) causes promotion procedure to stuck in all
joining.
I haven't seen such behavior in past, either we are now more sensitive to such
case (after -STOP case fixes) or the durability turned on rapidly raises the
probability.
Version-Release number of selected component (if applicable):
# rpm -qa | grep qpid | sort
perl-qpid-0.22-13.el6.i686
perl-qpid-debuginfo-0.22-13.el6.i686
python-qpid-0.22-15.el6.noarch
python-qpid-proton-doc-0.5-9.el6.noarch
python-qpid-qmf-0.22-33.el6.i686
qpid-cpp-client-0.22-42.el6.i686
qpid-cpp-client-devel-0.22-42.el6.i686
qpid-cpp-client-devel-docs-0.22-42.el6.noarch
qpid-cpp-client-rdma-0.22-42.el6.i686
qpid-cpp-debuginfo-0.22-42.el6.i686
qpid-cpp-server-0.22-42.el6.i686
qpid-cpp-server-devel-0.22-42.el6.i686
qpid-cpp-server-ha-0.22-42.el6.i686
qpid-cpp-server-linearstore-0.22-42.el6.i686
qpid-cpp-server-rdma-0.22-42.el6.i686
qpid-cpp-server-xml-0.22-42.el6.i686
qpid-java-client-0.22-6.el6.noarch
qpid-java-common-0.22-6.el6.noarch
qpid-java-example-0.22-6.el6.noarch
qpid-jca-0.22-2.el6.noarch
qpid-jca-xarecovery-0.22-2.el6.noarch
qpid-jca-zip-0.22-2.el6.noarch
qpid-proton-c-0.7-2.el6.i686
qpid-proton-c-devel-0.7-2.el6.i686
qpid-proton-c-devel-doc-0.5-9.el6.noarch
qpid-proton-debuginfo-0.7-2.el6.i686
qpid-qmf-0.22-33.el6.i686
qpid-qmf-debuginfo-0.22-33.el6.i686
qpid-qmf-devel-0.22-33.el6.i686
qpid-snmpd-1.0.0-16.el6.i686
qpid-snmpd-debuginfo-1.0.0-16.el6.i686
qpid-tests-0.22-15.el6.noarch
qpid-tools-0.22-13.el6.noarch
ruby-qpid-qmf-0.22-33.el6.i686
How reproducible:
rarely, timing is the key
Steps to Reproduce:
1. have configured 3 node cluster
2. start the whole cluster up
3. execute transactional sender to durable queue address with durable messages
and reconnect
4. repeatedly kill the primary broker once it is promoted
Actual results:
After few kills cluster ends up not functional all in joining. Ability to
bring qpid HA down by inserting single isolated failures to newly being
promoted brokers.
Expected results:
Qpid HA should be single failure at one time tolerant.
Additional info:
Details on failure insertion:
* kill -9 `pidof qpidd` is the failure action
* Assuming the duration between failure insertion and primary is ready to
serve named as T1
* failure insertion period T2 > T1 i.e. there are no cummulative failures
inserted while HA is getting through new primary promotion
-> this fact (in my view) proves that there is real issue
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]