Lorenz Quack created QPID-7685:
----------------------------------

             Summary: [Java Broker, BDB] AsyncRecovery and Queue#enqueue can 
contend for a BDB Lock potentially bringing down the broker
                 Key: QPID-7685
                 URL: https://issues.apache.org/jira/browse/QPID-7685
             Project: Qpid
          Issue Type: Bug
    Affects Versions: qpid-java-6.1.1, qpid-java-6.0.6, 0.32, qpid-java-7.0
            Reporter: Lorenz Quack
            Priority: Critical


This is the scenario:
BDB message store.
AsyncRecovery completes recovery of queue Q1. Queue Q2 is still being recovered.
An transacted application sends a message to Q1 but does not commit the 
transaction. This ends up creating an entry in the {{DeliveryDb}} through the 
following call chain:
> AMQChannel#deliverCurrentMessageIfComplete
--> AbstractExchange#send
----> LocalTransaction#enqueue(List<? extends BaseQueue>, EnqueueableMessage, 
ServerTransaction.Action)
------> AbstractBDBMessageStore.BDBTransaction#enqueueMessage
--------> AbstractBDBMessageStore#enqueueMessage
----------> getDeliveryDb().put
The transaction will hold a write lock on this new DB entry until the 
transaction is committed.
If it now happens that this entry appears immediately after the last entry of 
Q2 in the database the recovery code will try to read that entry in 
{{AbstractBDBMessageStore#visitMessageInstances(TransactionLogResource, 
MessageInstanceHandler)}}:
{{cursor.getNext(key, value, LockMode.DEFAULT)}}
Acquisition of the read lock will timeout after {{je.lock.timeout}} (default: 
500ms) and after exhaustion of the retries (hard coded to 5) a 
{{StoreException}} will be thrown bringing down the broker with the following 
stack trace:
{noformat}
########################################################################
#
# Unhandled Exception 
com.google.common.util.concurrent.UncheckedExecutionException: 
org.apache.qpid.server.store.StoreException: Cannot visit messages in Thread 
pool-2-thread-1
#
# Exiting
#
########################################################################
com.google.common.util.concurrent.UncheckedExecutionException: 
org.apache.qpid.server.store.StoreException: Cannot visit messages
        at 
com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1358)
        at 
com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1344)
        at 
org.apache.qpid.server.virtualhost.AbstractVirtualHost$4.run(AbstractVirtualHost.java:1619)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.qpid.server.store.StoreException: Cannot visit messages
        at 
org.apache.qpid.server.store.berkeleydb.StandardEnvironmentFacade.handleDatabaseException(StandardEnvironmentFacade.java:369)
        at 
org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.sleepOrThrowOnLockConflict(AbstractBDBMessageStore.java:665)
        at 
org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.visitMessageInstances(AbstractBDBMessageStore.java:233)
        at 
org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer.recoverQueue(AsynchronousMessageStoreRecoverer.java:163)
        at 
org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer.access$900(AsynchronousMessageStoreRecoverer.java:86)
        at 
org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer$QueueRecoveringTask.run(AsynchronousMessageStoreRecoverer.java:436)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        ... 3 more
Caused by: com.sleepycat.je.LockTimeoutException: (JE 5.0.104) Lock expired. 
Locker 2109822146 -1_Queue Recoverer : Q2 (vh: default)_ThreadLocker: waited 
for lock on database=QUEUE_ENTRIES LockAddr:530828558 LSN=0x0/0xa45c type=READ 
grant=WAIT_NEW timeoutMillis=500 startTime=1488294679798 endTime=1488294680298
Owners: [<LockInfo locker="879351915 
243_IoRcvr-/127.0.0.1:5672-/127.0.0.1:51619_Txn" type="WRITE"/>]
Waiters: []

        at 
com.sleepycat.je.txn.LockManager.newLockTimeoutException(LockManager.java:665)
        at 
com.sleepycat.je.txn.LockManager.makeTimeoutMsgInternal(LockManager.java:623)
        at 
com.sleepycat.je.txn.SyncedLockManager.makeTimeoutMsg(SyncedLockManager.java:97)
        at com.sleepycat.je.txn.LockManager.lockInternal(LockManager.java:390)
        at com.sleepycat.je.txn.LockManager.lock(LockManager.java:276)
        at com.sleepycat.je.txn.BasicLocker.lockInternal(BasicLocker.java:118)
        at com.sleepycat.je.txn.Locker.lock(Locker.java:443)
        at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:2642)
        at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:2443)
        at com.sleepycat.je.dbi.CursorImpl.fetchCurrent(CursorImpl.java:2272)
        at 
com.sleepycat.je.dbi.CursorImpl.getCurrentAlreadyLatched(CursorImpl.java:1480)
        at com.sleepycat.je.dbi.CursorImpl.getNext(CursorImpl.java:1607)
        at com.sleepycat.je.Cursor.searchInternal(Cursor.java:2898)
        at com.sleepycat.je.Cursor.searchAllowPhantoms(Cursor.java:2732)
        at com.sleepycat.je.Cursor.searchNoDups(Cursor.java:2586)
        at com.sleepycat.je.Cursor.search(Cursor.java:2553)
        at com.sleepycat.je.Cursor.getSearchKeyRange(Cursor.java:1757)
        at 
org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.visitMessageInstances(AbstractBDBMessageStore.java:203)
        ... 8 more
{noformat}

I believe the call to {{cursor.getSearchKeyRange(key,value, LockMode.DEFAULT)}} 
has the same problem just under different circumstances.

I have not checked whether other similar defects exist.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to