Alan Conway created QPID-5904:
---------------------------------

             Summary: qpid HA cluster may end-up in joining state after HA 
primary is killed
                 Key: QPID-5904
                 URL: https://issues.apache.org/jira/browse/QPID-5904
             Project: Qpid
          Issue Type: Bug
          Components: C++ Clustering
    Affects Versions: 0.28
            Reporter: Alan Conway
            Assignee: Alan Conway


 Frantisek Reznicek 2014-07-09 08:59:30 EDT

Description of problem:

qpid HA cluster may end-up in joining state after HA primary is killed.

Test scenario.
Let's have 3 node qpid HA cluster, all three nodes are operational.
Then a sender is executed and sending to queue (pure transactional with durable 
messages and durable queue address).
During that process primary broker is killed multiple times.
After N'th primary broker kill cluster is no longer functional as qpid brokers 
are ending all in joining states:

[root@dhcp-lab-216 ~]# qpid-ha status --all
192.168.6.60:5672 joining
192.168.6.61:5672 joining
192.168.6.62:5672 joining
[root@dhcp-x-216 ~]# clustat
Cluster Status for dtests_ha @ Wed Jul  9 14:38:44 2014
Member Status: Quorate

 Member Name                                   ID   Status
 ------ ----                                   ---- ------
 192.168.6.60                                      1 Online, Local, rgmanager
 192.168.6.61                                      2 Online, rgmanager
 192.168.6.62                                      3 Online, rgmanager

 Service Name                         Owner (Last)                         
State         
 ------- ----                         ----- ------                         
-----         
 service:qpidd_1                      192.168.6.60                         
started       
 service:qpidd_2                      192.168.6.61                         
started       
 service:qpidd_3                      192.168.6.62                         
started       
 service:qpidd_primary                (192.168.6.62)                       
stopped       


[root@dhcp-x-165 ~]# qpid-ha status --all
192.168.6.60:5672 joining
192.168.6.61:5672 joining
192.168.6.62:5672 joining

[root@dhcp-x-218 ~]# qpid-ha status --all
192.168.6.60:5672 joining
192.168.6.61:5672 joining
192.168.6.62:5672 joining


I believe the key to hit the issue is to kill the newly promoted primary soon 
after it starts appearing in starting/started state in clustat.

My current understanding is that if we have 3 node cluster then applying any 
failures to single node at one time should be handled by HA. This is what the 
testing scenario does:
A    B    C (nodes)
pri  bck  bck
kill 
bck  pri  bck
     kill
bck  bck  pri
          kill
...
pri  bck  bck
kill
bck  bck  bck


It looks to me that there is short time when promoting new primary when kill 
causes (of such primary newbee) causes promotion procedure to stuck in all 
joining.

I haven't seen such behavior in past, either we are now more sensitive to such 
case (after -STOP case fixes) or the durability turned on rapidly raises the 
probability.


Version-Release number of selected component (if applicable):
# rpm -qa | grep qpid | sort
perl-qpid-0.22-13.el6.i686
perl-qpid-debuginfo-0.22-13.el6.i686
python-qpid-0.22-15.el6.noarch
python-qpid-proton-doc-0.5-9.el6.noarch
python-qpid-qmf-0.22-33.el6.i686
qpid-cpp-client-0.22-42.el6.i686
qpid-cpp-client-devel-0.22-42.el6.i686
qpid-cpp-client-devel-docs-0.22-42.el6.noarch
qpid-cpp-client-rdma-0.22-42.el6.i686
qpid-cpp-debuginfo-0.22-42.el6.i686
qpid-cpp-server-0.22-42.el6.i686
qpid-cpp-server-devel-0.22-42.el6.i686
qpid-cpp-server-ha-0.22-42.el6.i686
qpid-cpp-server-linearstore-0.22-42.el6.i686
qpid-cpp-server-rdma-0.22-42.el6.i686
qpid-cpp-server-xml-0.22-42.el6.i686
qpid-java-client-0.22-6.el6.noarch
qpid-java-common-0.22-6.el6.noarch
qpid-java-example-0.22-6.el6.noarch
qpid-jca-0.22-2.el6.noarch
qpid-jca-xarecovery-0.22-2.el6.noarch
qpid-jca-zip-0.22-2.el6.noarch
qpid-proton-c-0.7-2.el6.i686
qpid-proton-c-devel-0.7-2.el6.i686
qpid-proton-c-devel-doc-0.5-9.el6.noarch
qpid-proton-debuginfo-0.7-2.el6.i686
qpid-qmf-0.22-33.el6.i686
qpid-qmf-debuginfo-0.22-33.el6.i686
qpid-qmf-devel-0.22-33.el6.i686
qpid-snmpd-1.0.0-16.el6.i686
qpid-snmpd-debuginfo-1.0.0-16.el6.i686
qpid-tests-0.22-15.el6.noarch
qpid-tools-0.22-13.el6.noarch
ruby-qpid-qmf-0.22-33.el6.i686


How reproducible:
rarely, timing is the key

Steps to Reproduce:
1. have configured 3 node cluster
2. start the whole cluster up
3. execute transactional sender to durable queue address with durable messages 
and reconnect
4. repeatedly kill the primary broker once it is promoted


Actual results:
  After few kills cluster ends up not functional all in joining. Ability to 
bring qpid HA down by inserting single isolated failures to newly being 
promoted brokers.

Expected results:
  Qpid HA should be single failure at one time tolerant.

Additional info:
  Details on failure insertion:
    * kill -9 `pidof qpidd` is the failure action
    * Assuming the duration between failure insertion and primary is ready to 
serve named as T1
    * failure insertion period T2 > T1 i.e. there are no cummulative failures 
inserted while HA is getting through new primary promotion
      -> this fact (in my view) proves that there is real issue



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to