[jira] [Commented] (QPID-7685) [Java Broker, BDB] AsyncRecovery and Queue#enqueue can contend for a BDB Lock potentially bringing down the broker

Lorenz Quack (JIRA) Tue, 28 Feb 2017 07:22:20 -0800

    [ 
https://issues.apache.org/jira/browse/QPID-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15888196#comment-15888196
 ]


Lorenz Quack commented on QPID-7685:
------------------------------------

This can be reproduced by
* setting a conditional (to only stop on one queue) breakpoint BP1 in 
{{AbstractBDBMessageStore#visitMessageInstances(TransactionLogResource, 
MessageInstanceHandler)}} before the creation of the cursor. This breakpoint 
must only halt the single thread.
* setting a unconditional breakpoint BP2 in a publishing application after 
sending but before committing the transaction.
* when the broker starts it one recovery thread will hit BP1. Let the rest of 
the broker start up.
* start the publishing application until it hits BP2.
* resume execution of BP1 and watch it time out and the broker crash after 5 to 
10 seconds.


> [Java Broker, BDB] AsyncRecovery and Queue#enqueue can contend for a BDB Lock 
> potentially bringing down the broker
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: QPID-7685
>                 URL: https://issues.apache.org/jira/browse/QPID-7685
>             Project: Qpid
>          Issue Type: Bug
>    Affects Versions: 0.32, qpid-java-6.0.6, qpid-java-6.1.1, qpid-java-7.0
>            Reporter: Lorenz Quack
>            Priority: Critical
>
> This is the scenario:
> BDB message store.
> AsyncRecovery completes recovery of queue Q1. Queue Q2 is still being 
> recovered.
> An transacted application sends a message to Q1 but does not commit the 
> transaction. This ends up creating an entry in the {{DeliveryDb}} through the 
> following call chain:
> > AMQChannel#deliverCurrentMessageIfComplete
> --> AbstractExchange#send
> ----> LocalTransaction#enqueue(List<? extends BaseQueue>, EnqueueableMessage, 
> ServerTransaction.Action)
> ------> AbstractBDBMessageStore.BDBTransaction#enqueueMessage
> --------> AbstractBDBMessageStore#enqueueMessage
> ----------> getDeliveryDb().put
> The transaction will hold a write lock on this new DB entry until the 
> transaction is committed.
> If it now happens that this entry appears immediately after the last entry of 
> Q2 in the database the recovery code will try to read that entry in 
> {{AbstractBDBMessageStore#visitMessageInstances(TransactionLogResource, 
> MessageInstanceHandler)}}:
> {{cursor.getNext(key, value, LockMode.DEFAULT)}}
> Acquisition of the read lock will timeout after {{je.lock.timeout}} (default: 
> 500ms) and after exhaustion of the retries (hard coded to 5) a 
> {{StoreException}} will be thrown bringing down the broker with the following 
> stack trace:
> {noformat}
> ########################################################################
> #
> # Unhandled Exception 
> com.google.common.util.concurrent.UncheckedExecutionException: 
> org.apache.qpid.server.store.StoreException: Cannot visit messages in Thread 
> pool-2-thread-1
> #
> # Exiting
> #
> ########################################################################
> com.google.common.util.concurrent.UncheckedExecutionException: 
> org.apache.qpid.server.store.StoreException: Cannot visit messages
>       at 
> com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1358)
>       at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1344)
>       at 
> org.apache.qpid.server.virtualhost.AbstractVirtualHost$4.run(AbstractVirtualHost.java:1619)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.qpid.server.store.StoreException: Cannot visit messages
>       at 
> org.apache.qpid.server.store.berkeleydb.StandardEnvironmentFacade.handleDatabaseException(StandardEnvironmentFacade.java:369)
>       at 
> org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.sleepOrThrowOnLockConflict(AbstractBDBMessageStore.java:665)
>       at 
> org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.visitMessageInstances(AbstractBDBMessageStore.java:233)
>       at 
> org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer.recoverQueue(AsynchronousMessageStoreRecoverer.java:163)
>       at 
> org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer.access$900(AsynchronousMessageStoreRecoverer.java:86)
>       at 
> org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer$QueueRecoveringTask.run(AsynchronousMessageStoreRecoverer.java:436)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       ... 3 more
> Caused by: com.sleepycat.je.LockTimeoutException: (JE 5.0.104) Lock expired. 
> Locker 2109822146 -1_Queue Recoverer : Q2 (vh: default)_ThreadLocker: waited 
> for lock on database=QUEUE_ENTRIES LockAddr:530828558 LSN=0x0/0xa45c 
> type=READ grant=WAIT_NEW timeoutMillis=500 startTime=1488294679798 
> endTime=1488294680298
> Owners: [<LockInfo locker="879351915 
> 243_IoRcvr-/127.0.0.1:5672-/127.0.0.1:51619_Txn" type="WRITE"/>]
> Waiters: []
>       at 
> com.sleepycat.je.txn.LockManager.newLockTimeoutException(LockManager.java:665)
>       at 
> com.sleepycat.je.txn.LockManager.makeTimeoutMsgInternal(LockManager.java:623)
>       at 
> com.sleepycat.je.txn.SyncedLockManager.makeTimeoutMsg(SyncedLockManager.java:97)
>       at com.sleepycat.je.txn.LockManager.lockInternal(LockManager.java:390)
>       at com.sleepycat.je.txn.LockManager.lock(LockManager.java:276)
>       at com.sleepycat.je.txn.BasicLocker.lockInternal(BasicLocker.java:118)
>       at com.sleepycat.je.txn.Locker.lock(Locker.java:443)
>       at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:2642)
>       at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:2443)
>       at com.sleepycat.je.dbi.CursorImpl.fetchCurrent(CursorImpl.java:2272)
>       at 
> com.sleepycat.je.dbi.CursorImpl.getCurrentAlreadyLatched(CursorImpl.java:1480)
>       at com.sleepycat.je.dbi.CursorImpl.getNext(CursorImpl.java:1607)
>       at com.sleepycat.je.Cursor.searchInternal(Cursor.java:2898)
>       at com.sleepycat.je.Cursor.searchAllowPhantoms(Cursor.java:2732)
>       at com.sleepycat.je.Cursor.searchNoDups(Cursor.java:2586)
>       at com.sleepycat.je.Cursor.search(Cursor.java:2553)
>       at com.sleepycat.je.Cursor.getSearchKeyRange(Cursor.java:1757)
>       at 
> org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.visitMessageInstances(AbstractBDBMessageStore.java:203)
>       ... 8 more
> {noformat}
> I believe the call to {{cursor.getSearchKeyRange(key,value, 
> LockMode.DEFAULT)}} has the same problem just under different circumstances.
> I have not checked whether other similar defects exist.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (QPID-7685) [Java Broker, BDB] AsyncRecovery and Queue#enqueue can contend for a BDB Lock potentially bringing down the broker

Reply via email to