Alan Conway created QPID-5974:
---------------------------------

             Summary: HA qpid-txtest2 can bring down a cluster (JERR_MAP_LOCKED)
                 Key: QPID-5974
                 URL: https://issues.apache.org/jira/browse/QPID-5974
             Project: Qpid
          Issue Type: Bug
          Components: C++ Clustering
    Affects Versions: 0.28
            Reporter: Alan Conway
            Assignee: Alan Conway


Description of problem:

qpid-txtest2 AMQP0-10 transactional & durable transfer operation can bring down 
whole qpid HA.

It was multiple times seen that qpid-txtest2 transfer operation suddenly failed:

######################### init queues and add messages
[root@mrg-qe-10 clients]# txtest 192.168.6.251 1 0 0 500 10 10 1000 1
######################### start stansfers (no kill)
[root@mrg-qe-10 clients]# txtest 192.168.6.251 0 1 0 500 10 10 1000 1
Transfering from tx-test2-1 to tx-test2-2
Transfering from tx-test2-2 to tx-test2-3
Transfering from tx-test2-3 to tx-test2-4
Transfering from tx-test2-4 to tx-test2-5
Transfering from tx-test2-5 to tx-test2-6
Transfering from tx-test2-6 to tx-test2-7
Transfering from tx-test2-7 to tx-test2-8
Transfering from tx-test2-8 to tx-test2-9
Transfering from tx-test2-9 to tx-test2-10
Transfering from tx-test2-10 to tx-test2-1
Transaction 10 of 500 committed successfully
Transaction 10 of 500 committed successfully
Transaction 10 of 500 committed successfully
Transaction 10 of 500 committed successfully
2014-07-24 08:10:39 [Client] warning Connection 
[192.168.6.1:58590-192.168.6.251:5672] closed
2014-07-24 08:10:39 [Client] warning Connection 
[192.168.6.1:58586-192.168.6.251:5672] closed
2014-07-24 08:10:39 [Client] warning Connection 
[192.168.6.1:58584-192.168.6.251:5672] closed
2014-07-24 08:10:39 [Unspecified] error Transfer tx-test2-3 to tx-test2-2 
interrupted: Failed to connect (reconnect disabled)
2014-07-24 08:10:39 [Unspecified] error Transfer tx-test2-7 to tx-test2-6 
interrupted: Failed to connect (reconnect disabled)
2014-07-24 08:10:40 [Client] warning Connection 
[192.168.6.1:58587-192.168.6.251:5672] closed
2014-07-24 08:10:40 [Unspecified] error Transfer tx-test2-4 to tx-test2-3 
interrupted: Failed to connect (reconnect disabled)
2014-07-24 08:10:40 [Client] warning Connection 
[192.168.6.1:58589-192.168.6.251:5672] closed
2014-07-24 08:10:40 [Unspecified] error Transfer tx-test2-6 to tx-test2-5 
interrupted: Failed to connect (reconnect disabled)
2014-07-24 08:10:40 [Client] warning Connection 
[192.168.6.1:58592-192.168.6.251:5672] closed
2014-07-24 08:10:40 [Unspecified] error Transfer tx-test2-9 to tx-test2-8 
interrupted: Failed to connect (reconnect disabled)
2014-07-24 08:10:40 [Client] warning Connection 
[192.168.6.1:58594-192.168.6.251:5672] closed
2014-07-24 08:10:40 [Unspecified] error Transfer tx-test2-1 to tx-test2-10 
interrupted: Failed to connect (reconnect disabled)
2014-07-24 08:10:40 [Client] warning Connection 
[192.168.6.1:58593-192.168.6.251:5672] closed
2014-07-24 08:10:40 [Unspecified] error Transfer tx-test2-10 to tx-test2-9 
interrupted: Failed to connect (reconnect disabled)
2014-07-24 08:10:40 [Client] warning Connection 
[192.168.6.1:58585-192.168.6.251:5672] closed
2014-07-24 08:10:40 [Unspecified] error Transfer tx-test2-2 to tx-test2-1 
interrupted: Failed to connect (reconnect disabled)
2014-07-24 08:10:40 [Client] warning Connection 
[192.168.6.1:58591-192.168.6.251:5672] closed
2014-07-24 08:10:40 [Unspecified] error Transfer tx-test2-8 to tx-test2-7 
interrupted: Failed to connect (reconnect disabled)
2014-07-24 08:10:40 [Client] warning Connection 
[192.168.6.1:58588-192.168.6.251:5672] closed
2014-07-24 08:10:40 [Unspecified] error Transfer tx-test2-5 to tx-test2-4 
interrupted: Failed to connect (reconnect disabled)
Transaction aborted due to transport failure
Transaction aborted due to transport failure
Transaction aborted due to transport failure
Transaction aborted due to transport failure
Transaction aborted due to transport failure
Transaction aborted due to transport failure
Transaction aborted due to transport failure
Transaction aborted due to transport failure
Transaction aborted due to transport failure
Transaction aborted due to transport failure
Failed to connect (reconnect disabled)
Transfer agents failed
[root@mrg-qe-10 clients]# 
[root@mrg-qe-10 clients]# echo $?
2

Surprisingly although no failure was inserted HA qpid cluster ends up in 
following state:

[root@dhcp-lab-A ~]# clustat
Cluster Status for dtests_ha @ Thu Jul 24 14:17:57 2014
Member Status: Quorate

 Member Name                                      ID   Status
 ------ ----                                      ---- ------
 192.168.6.246                                        1 Online, Local, rgmanager
 192.168.6.247                                        2 Online, rgmanager
 192.168.6.248                                        3 Online, rgmanager

 Service Name                            Owner (Last)                           
 State         
 ------- ----                            ----- ------                           
 -----         
 service:qpidd_1                         192.168.6.246                          
 started       
 service:qpidd_2                         192.168.6.247                          
 started       
 service:qpidd_3                         192.168.6.248                          
 started       
 service:qpidd_primary                   (192.168.6.248)                        
 stopped       
[root@dhcp-lab-A ~]# pidof qpidd
26137
[root@dhcp-lab-A ~]# qpid-ha status --all
192.168.6.246:5672 joining
192.168.6.247:5672 joining
192.168.6.248:5672 joining

so brokers are up all in joining, HA not available.


Version-Release number of selected component (if applicable):


How reproducible:
10-50% (to be precised)

Steps to Reproduce:
1. start-up the 3-node HA cluster with virt-ip
2. init the queues and data
   txtest 192.168.6.251 1 0 0 500 10 10 1000 1
3. transfer among queues
   txtest 192.168.6.251 0 1 0 500 10 10 1000 1


Actual results:
  qpid-txtest2 failing.

Expected results:
  qpid-txtest2 not failing.

Additional info:

function txtest()
{
  ./qpid_txtest2 -b $1 --init $2 --transfer $3 --check $4 \
                 --tx-count $5 --queues $6 --messages-per-tx $7 
--total-messages $8 --durable $9
}

qpidd.log inspection highlighted unexpected linearstore critical exceptions:

[root@dhcp-lab-A ~]# grep -E 'error|critical' ~qpidd/qpidd.log
2014-07-24 14:10:33 [Protocol] error Connection 
qpid.192.168.6.246:5672-192.168.6.247:34210 timed out: closing


[root@dhcp-lab-B ~]# grep -E 'error|critical' ~qpidd/qpidd.log
2014-07-24 14:10:23 [HA] critical Shutting down: Backup of tx-test2-1: 
Replication failed: Queue tx-test2-1: async_dequeue() failed: jexception 0x0b02 
wmgr::dequeue_check() threw JERR_MAP_LOCKED: Record ID locked by a pending 
transaction. (drid=0x6da3) 
(/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1268)
 (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/ha/QueueReplicator.cpp:315)
2014-07-24 14:10:23 [Protocol] error Connection 
qpid.ha.link.09e80392-0c79-4239-a1d0-ea5b53c71bd9 closed by error: Backup of 
tx-test2-1: Replication failed: Queue tx-test2-1: async_dequeue() failed: 
jexception 0x0b02 wmgr::dequeue_check() threw JERR_MAP_LOCKED: Record ID locked 
by a pending transaction. (drid=0x6da3) 
(/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1268)
 (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/ha/QueueReplicator.cpp:315)(501)
2014-07-24 14:10:24 [Broker] error Could not find dequeued message on commit
2014-07-24 14:10:24 [HA] error Backup of transaction 00648954: Destroyed 
prematurely, rollback
2014-07-24 14:10:24 [HA] error Backup of transaction 2f556197: Destroyed 
prematurely, rollback
2014-07-24 14:10:24 [HA] error Backup of transaction 5bd58ffe: Destroyed 
prematurely, rollback
2014-07-24 14:10:24 [HA] error Backup of transaction 5d34703c: Destroyed 
prematurely, rollback
2014-07-24 14:10:24 [HA] error Backup of transaction 7e93a7ea: Destroyed 
prematurely, rollback
2014-07-24 14:10:24 [HA] error Backup of transaction e8856f6f: Destroyed 
prematurely, rollback
2014-07-24 14:10:38 [HA] critical Shutting down: Backup of tx-test2-1: 
Replication failed: Queue tx-test2-1: async_dequeue() failed: jexception 0x0b02 
wmgr::dequeue_check() threw JERR_MAP_LOCKED: Record ID locked by a pending 
transaction. (drid=0x7a42) 
(/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1268)
 (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/ha/QueueReplicator.cpp:315)
2014-07-24 14:10:38 [Protocol] error Connection 
qpid.ha.link.0fc6bd3c-48c2-4b27-9db3-2742b3ddc835 closed by error: Backup of 
tx-test2-1: Replication failed: Queue tx-test2-1: async_dequeue() failed: 
jexception 0x0b02 wmgr::dequeue_check() threw JERR_MAP_LOCKED: Record ID locked 
by a pending transaction. (drid=0x7a42) 
(/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1268)
 (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/ha/QueueReplicator.cpp:315)(501)
2014-07-24 14:10:38 [Broker] error Could not find dequeued message on commit
2014-07-24 14:10:38 [HA] critical Shutting down: Backup of tx-test2-10: 
Replication failed: Queue tx-test2-10: async_dequeue() failed: jexception 
0x0b02 wmgr::dequeue_check() threw JERR_MAP_LOCKED: Record ID locked by a 
pending transaction. (drid=0x7a43) 
(/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1268)
 (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/ha/QueueReplicator.cpp:315)
2014-07-24 14:10:38 [Protocol] error Connection 
qpid.ha.link.0fc6bd3c-48c2-4b27-9db3-2742b3ddc835 closed by error: Backup of 
tx-test2-10: Replication failed: Queue tx-test2-10: async_dequeue() failed: 
jexception 0x0b02 wmgr::dequeue_check() threw JERR_MAP_LOCKED: Record ID locked 
by a pending transaction. (drid=0x7a43) 
(/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1268)
 (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/ha/QueueReplicator.cpp:315)(501)
2014-07-24 14:10:38 [Broker] error Could not find dequeued message on commit
2014-07-24 14:10:40 [HA] critical Shutting down: Backup of tx-test2-7: 
Replication failed: Queue tx-test2-7: async_dequeue() failed: jexception 0x0b02 
wmgr::dequeue_check() threw JERR_MAP_LOCKED: Record ID locked by a pending 
transaction. (drid=0x7a49) 
(/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1268)
 (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/ha/QueueReplicator.cpp:315)
2014-07-24 14:10:40 [Protocol] error Connection 
qpid.ha.link.0fc6bd3c-48c2-4b27-9db3-2742b3ddc835 closed by error: Backup of 
tx-test2-7: Replication failed: Queue tx-test2-7: async_dequeue() failed: 
jexception 0x0b02 wmgr::dequeue_check() threw JERR_MAP_LOCKED: Record ID locked 
by a pending transaction. (drid=0x7a49) 
(/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1268)
 (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/ha/QueueReplicator.cpp:315)(501)
2014-07-24 14:10:40 [Broker] error Could not find dequeued message on commit
2014-07-24 14:11:10 [HA] error Backup: Joining active cluster, cannot be 
promoted.


[root@dhcp-lab-C ~]# grep -E 'error|critical' ~qpidd/qpidd.log
2014-07-24 14:10:23 [HA] critical Shutting down: Backup of tx-test2-1: 
Replication failed: Queue tx-test2-1: async_dequeue() failed: jexception 0x0b02 
wmgr::dequeue_check() threw JERR_MAP_LOCKED: Record ID locked by a pending 
transaction. (drid=0x53a3) 
(/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1268)
 (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/ha/QueueReplicator.cpp:315)
2014-07-24 14:10:23 [Protocol] error Connection 
qpid.ha.link.1bb57f0a-48db-460c-9260-0f5b353e4bd1 closed by error: Backup of 
tx-test2-1: Replication failed: Queue tx-test2-1: async_dequeue() failed: 
jexception 0x0b02 wmgr::dequeue_check() threw JERR_MAP_LOCKED: Record ID locked 
by a pending transaction. (drid=0x53a3) 
(/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1268)
 (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/ha/QueueReplicator.cpp:315)(501)
2014-07-24 14:10:24 [Broker] error Could not find dequeued message on commit
2014-07-24 14:10:24 [HA] error Backup of transaction 00648954: Destroyed 
prematurely, rollback
2014-07-24 14:10:24 [HA] error Backup of transaction 2f556197: Destroyed 
prematurely, rollback
2014-07-24 14:10:24 [HA] error Backup of transaction 5bd58ffe: Destroyed 
prematurely, rollback
2014-07-24 14:10:24 [HA] error Backup of transaction 5d34703c: Destroyed 
prematurely, rollback
2014-07-24 14:10:24 [HA] error Backup of transaction 7e93a7ea: Destroyed 
prematurely, rollback
2014-07-24 14:10:24 [HA] error Backup of transaction e8856f6f: Destroyed 
prematurely, rollback
2014-07-24 14:10:35 [HA] error Backup of transaction 243b4279: Destroyed 
prematurely, rollback
2014-07-24 14:10:35 [HA] error Backup of transaction 4f4a25df: Destroyed 
prematurely, rollback
2014-07-24 14:10:35 [HA] error Backup of transaction 80cbe9af: Destroyed 
prematurely, rollback
2014-07-24 14:10:35 [HA] error Backup of transaction a3ed917a: Destroyed 
prematurely, rollback
2014-07-24 14:10:35 [HA] error Backup of transaction b7a4b9a0: Destroyed 
prematurely, rollback
2014-07-24 14:10:35 [HA] error Backup of transaction b9ba9995: Destroyed 
prematurely, rollback
2014-07-24 14:10:35 [HA] error Backup of transaction cbd0d6bf: Destroyed 
prematurely, rollback
2014-07-24 14:10:35 [HA] error Backup of transaction e127288a: Destroyed 
prematurely, rollback
2014-07-24 14:10:35 [HA] error Backup of transaction eb43e683: Destroyed 
prematurely, rollback
2014-07-24 14:10:35 [HA] error Backup of transaction f29196c1: Destroyed 
prematurely, rollback
2014-07-24 14:10:53 [HA] error Backup: Still catching up, cannot be promoted.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to