[
https://issues.apache.org/jira/browse/QPID-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lorenz Quack updated QPID-7685:
-------------------------------
Description:
This is the scenario:
BDB message store.
AsyncRecovery completes recovery of queue Q1. Queue Q2 is still being recovered.
An transacted application sends a message to Q1 but does not commit the
transaction. This ends up creating an entry in the {{DeliveryDb}} through the
following call chain:
> AMQChannel#deliverCurrentMessageIfComplete
--> AbstractExchange#send
----> LocalTransaction#enqueue(List<? extends BaseQueue>, EnqueueableMessage,
ServerTransaction.Action)
------> AbstractBDBMessageStore.BDBTransaction#enqueueMessage
--------> AbstractBDBMessageStore#enqueueMessage
----------> getDeliveryDb().put
The transaction will hold a write lock on this new DB entry until the
transaction is committed.
If it now happens that this entry appears immediately after the last entry of
Q2 in the database the recovery code will try to read that entry in
{{AbstractBDBMessageStore#visitMessageInstances(TransactionLogResource,
MessageInstanceHandler)}}:
{{cursor.getNext(key, value, LockMode.DEFAULT)}}
Acquisition of the read lock will timeout after {{je.lock.timeout}} (default:
500ms) and after exhaustion of the retries (hard coded to 5) a
{{StoreException}} will be thrown bringing down the broker with the following
stack trace (from a broker derived from 0.32):
{noformat}
########################################################################
#
# Unhandled Exception
com.google.common.util.concurrent.UncheckedExecutionException:
org.apache.qpid.server.store.StoreException: Cannot visit messages in Thread
pool-2-thread-1
#
# Exiting
#
########################################################################
com.google.common.util.concurrent.UncheckedExecutionException:
org.apache.qpid.server.store.StoreException: Cannot visit messages
at
com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1358)
at
com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1344)
at
org.apache.qpid.server.virtualhost.AbstractVirtualHost$4.run(AbstractVirtualHost.java:1619)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.qpid.server.store.StoreException: Cannot visit messages
at
org.apache.qpid.server.store.berkeleydb.StandardEnvironmentFacade.handleDatabaseException(StandardEnvironmentFacade.java:369)
at
org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.sleepOrThrowOnLockConflict(AbstractBDBMessageStore.java:665)
at
org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.visitMessageInstances(AbstractBDBMessageStore.java:233)
at
org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer.recoverQueue(AsynchronousMessageStoreRecoverer.java:163)
at
org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer.access$900(AsynchronousMessageStoreRecoverer.java:86)
at
org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer$QueueRecoveringTask.run(AsynchronousMessageStoreRecoverer.java:436)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
... 3 more
Caused by: com.sleepycat.je.LockTimeoutException: (JE 5.0.104) Lock expired.
Locker 2109822146 -1_Queue Recoverer : Q2 (vh: default)_ThreadLocker: waited
for lock on database=QUEUE_ENTRIES LockAddr:530828558 LSN=0x0/0xa45c type=READ
grant=WAIT_NEW timeoutMillis=500 startTime=1488294679798 endTime=1488294680298
Owners: [<LockInfo locker="879351915
243_IoRcvr-/127.0.0.1:5672-/127.0.0.1:51619_Txn" type="WRITE"/>]
Waiters: []
at
com.sleepycat.je.txn.LockManager.newLockTimeoutException(LockManager.java:665)
at
com.sleepycat.je.txn.LockManager.makeTimeoutMsgInternal(LockManager.java:623)
at
com.sleepycat.je.txn.SyncedLockManager.makeTimeoutMsg(SyncedLockManager.java:97)
at com.sleepycat.je.txn.LockManager.lockInternal(LockManager.java:390)
at com.sleepycat.je.txn.LockManager.lock(LockManager.java:276)
at com.sleepycat.je.txn.BasicLocker.lockInternal(BasicLocker.java:118)
at com.sleepycat.je.txn.Locker.lock(Locker.java:443)
at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:2642)
at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:2443)
at com.sleepycat.je.dbi.CursorImpl.fetchCurrent(CursorImpl.java:2272)
at
com.sleepycat.je.dbi.CursorImpl.getCurrentAlreadyLatched(CursorImpl.java:1480)
at com.sleepycat.je.dbi.CursorImpl.getNext(CursorImpl.java:1607)
at com.sleepycat.je.Cursor.searchInternal(Cursor.java:2898)
at com.sleepycat.je.Cursor.searchAllowPhantoms(Cursor.java:2732)
at com.sleepycat.je.Cursor.searchNoDups(Cursor.java:2586)
at com.sleepycat.je.Cursor.search(Cursor.java:2553)
at com.sleepycat.je.Cursor.getSearchKeyRange(Cursor.java:1757)
at
org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.visitMessageInstances(AbstractBDBMessageStore.java:203)
... 8 more
{noformat}
I believe the call to {{cursor.getSearchKeyRange(key,value, LockMode.DEFAULT)}}
has the same problem just under different circumstances.
I have not checked whether other similar defects exist.
was:
This is the scenario:
BDB message store.
AsyncRecovery completes recovery of queue Q1. Queue Q2 is still being recovered.
An transacted application sends a message to Q1 but does not commit the
transaction. This ends up creating an entry in the {{DeliveryDb}} through the
following call chain:
> AMQChannel#deliverCurrentMessageIfComplete
--> AbstractExchange#send
----> LocalTransaction#enqueue(List<? extends BaseQueue>, EnqueueableMessage,
ServerTransaction.Action)
------> AbstractBDBMessageStore.BDBTransaction#enqueueMessage
--------> AbstractBDBMessageStore#enqueueMessage
----------> getDeliveryDb().put
The transaction will hold a write lock on this new DB entry until the
transaction is committed.
If it now happens that this entry appears immediately after the last entry of
Q2 in the database the recovery code will try to read that entry in
{{AbstractBDBMessageStore#visitMessageInstances(TransactionLogResource,
MessageInstanceHandler)}}:
{{cursor.getNext(key, value, LockMode.DEFAULT)}}
Acquisition of the read lock will timeout after {{je.lock.timeout}} (default:
500ms) and after exhaustion of the retries (hard coded to 5) a
{{StoreException}} will be thrown bringing down the broker with the following
stack trace:
{noformat}
########################################################################
#
# Unhandled Exception
com.google.common.util.concurrent.UncheckedExecutionException:
org.apache.qpid.server.store.StoreException: Cannot visit messages in Thread
pool-2-thread-1
#
# Exiting
#
########################################################################
com.google.common.util.concurrent.UncheckedExecutionException:
org.apache.qpid.server.store.StoreException: Cannot visit messages
at
com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1358)
at
com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1344)
at
org.apache.qpid.server.virtualhost.AbstractVirtualHost$4.run(AbstractVirtualHost.java:1619)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.qpid.server.store.StoreException: Cannot visit messages
at
org.apache.qpid.server.store.berkeleydb.StandardEnvironmentFacade.handleDatabaseException(StandardEnvironmentFacade.java:369)
at
org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.sleepOrThrowOnLockConflict(AbstractBDBMessageStore.java:665)
at
org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.visitMessageInstances(AbstractBDBMessageStore.java:233)
at
org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer.recoverQueue(AsynchronousMessageStoreRecoverer.java:163)
at
org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer.access$900(AsynchronousMessageStoreRecoverer.java:86)
at
org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer$QueueRecoveringTask.run(AsynchronousMessageStoreRecoverer.java:436)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
... 3 more
Caused by: com.sleepycat.je.LockTimeoutException: (JE 5.0.104) Lock expired.
Locker 2109822146 -1_Queue Recoverer : Q2 (vh: default)_ThreadLocker: waited
for lock on database=QUEUE_ENTRIES LockAddr:530828558 LSN=0x0/0xa45c type=READ
grant=WAIT_NEW timeoutMillis=500 startTime=1488294679798 endTime=1488294680298
Owners: [<LockInfo locker="879351915
243_IoRcvr-/127.0.0.1:5672-/127.0.0.1:51619_Txn" type="WRITE"/>]
Waiters: []
at
com.sleepycat.je.txn.LockManager.newLockTimeoutException(LockManager.java:665)
at
com.sleepycat.je.txn.LockManager.makeTimeoutMsgInternal(LockManager.java:623)
at
com.sleepycat.je.txn.SyncedLockManager.makeTimeoutMsg(SyncedLockManager.java:97)
at com.sleepycat.je.txn.LockManager.lockInternal(LockManager.java:390)
at com.sleepycat.je.txn.LockManager.lock(LockManager.java:276)
at com.sleepycat.je.txn.BasicLocker.lockInternal(BasicLocker.java:118)
at com.sleepycat.je.txn.Locker.lock(Locker.java:443)
at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:2642)
at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:2443)
at com.sleepycat.je.dbi.CursorImpl.fetchCurrent(CursorImpl.java:2272)
at
com.sleepycat.je.dbi.CursorImpl.getCurrentAlreadyLatched(CursorImpl.java:1480)
at com.sleepycat.je.dbi.CursorImpl.getNext(CursorImpl.java:1607)
at com.sleepycat.je.Cursor.searchInternal(Cursor.java:2898)
at com.sleepycat.je.Cursor.searchAllowPhantoms(Cursor.java:2732)
at com.sleepycat.je.Cursor.searchNoDups(Cursor.java:2586)
at com.sleepycat.je.Cursor.search(Cursor.java:2553)
at com.sleepycat.je.Cursor.getSearchKeyRange(Cursor.java:1757)
at
org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.visitMessageInstances(AbstractBDBMessageStore.java:203)
... 8 more
{noformat}
I believe the call to {{cursor.getSearchKeyRange(key,value, LockMode.DEFAULT)}}
has the same problem just under different circumstances.
I have not checked whether other similar defects exist.
> [Java Broker, BDB] AsyncRecovery and Queue#enqueue can contend for a BDB Lock
> potentially bringing down the broker
> ------------------------------------------------------------------------------------------------------------------
>
> Key: QPID-7685
> URL: https://issues.apache.org/jira/browse/QPID-7685
> Project: Qpid
> Issue Type: Bug
> Affects Versions: 0.32, qpid-java-6.0.6, qpid-java-6.1.1, qpid-java-7.0
> Reporter: Lorenz Quack
> Priority: Critical
>
> This is the scenario:
> BDB message store.
> AsyncRecovery completes recovery of queue Q1. Queue Q2 is still being
> recovered.
> An transacted application sends a message to Q1 but does not commit the
> transaction. This ends up creating an entry in the {{DeliveryDb}} through the
> following call chain:
> > AMQChannel#deliverCurrentMessageIfComplete
> --> AbstractExchange#send
> ----> LocalTransaction#enqueue(List<? extends BaseQueue>, EnqueueableMessage,
> ServerTransaction.Action)
> ------> AbstractBDBMessageStore.BDBTransaction#enqueueMessage
> --------> AbstractBDBMessageStore#enqueueMessage
> ----------> getDeliveryDb().put
> The transaction will hold a write lock on this new DB entry until the
> transaction is committed.
> If it now happens that this entry appears immediately after the last entry of
> Q2 in the database the recovery code will try to read that entry in
> {{AbstractBDBMessageStore#visitMessageInstances(TransactionLogResource,
> MessageInstanceHandler)}}:
> {{cursor.getNext(key, value, LockMode.DEFAULT)}}
> Acquisition of the read lock will timeout after {{je.lock.timeout}} (default:
> 500ms) and after exhaustion of the retries (hard coded to 5) a
> {{StoreException}} will be thrown bringing down the broker with the following
> stack trace (from a broker derived from 0.32):
> {noformat}
> ########################################################################
> #
> # Unhandled Exception
> com.google.common.util.concurrent.UncheckedExecutionException:
> org.apache.qpid.server.store.StoreException: Cannot visit messages in Thread
> pool-2-thread-1
> #
> # Exiting
> #
> ########################################################################
> com.google.common.util.concurrent.UncheckedExecutionException:
> org.apache.qpid.server.store.StoreException: Cannot visit messages
> at
> com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1358)
> at
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1344)
> at
> org.apache.qpid.server.virtualhost.AbstractVirtualHost$4.run(AbstractVirtualHost.java:1619)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.qpid.server.store.StoreException: Cannot visit messages
> at
> org.apache.qpid.server.store.berkeleydb.StandardEnvironmentFacade.handleDatabaseException(StandardEnvironmentFacade.java:369)
> at
> org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.sleepOrThrowOnLockConflict(AbstractBDBMessageStore.java:665)
> at
> org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.visitMessageInstances(AbstractBDBMessageStore.java:233)
> at
> org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer.recoverQueue(AsynchronousMessageStoreRecoverer.java:163)
> at
> org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer.access$900(AsynchronousMessageStoreRecoverer.java:86)
> at
> org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer$QueueRecoveringTask.run(AsynchronousMessageStoreRecoverer.java:436)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> ... 3 more
> Caused by: com.sleepycat.je.LockTimeoutException: (JE 5.0.104) Lock expired.
> Locker 2109822146 -1_Queue Recoverer : Q2 (vh: default)_ThreadLocker: waited
> for lock on database=QUEUE_ENTRIES LockAddr:530828558 LSN=0x0/0xa45c
> type=READ grant=WAIT_NEW timeoutMillis=500 startTime=1488294679798
> endTime=1488294680298
> Owners: [<LockInfo locker="879351915
> 243_IoRcvr-/127.0.0.1:5672-/127.0.0.1:51619_Txn" type="WRITE"/>]
> Waiters: []
> at
> com.sleepycat.je.txn.LockManager.newLockTimeoutException(LockManager.java:665)
> at
> com.sleepycat.je.txn.LockManager.makeTimeoutMsgInternal(LockManager.java:623)
> at
> com.sleepycat.je.txn.SyncedLockManager.makeTimeoutMsg(SyncedLockManager.java:97)
> at com.sleepycat.je.txn.LockManager.lockInternal(LockManager.java:390)
> at com.sleepycat.je.txn.LockManager.lock(LockManager.java:276)
> at com.sleepycat.je.txn.BasicLocker.lockInternal(BasicLocker.java:118)
> at com.sleepycat.je.txn.Locker.lock(Locker.java:443)
> at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:2642)
> at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:2443)
> at com.sleepycat.je.dbi.CursorImpl.fetchCurrent(CursorImpl.java:2272)
> at
> com.sleepycat.je.dbi.CursorImpl.getCurrentAlreadyLatched(CursorImpl.java:1480)
> at com.sleepycat.je.dbi.CursorImpl.getNext(CursorImpl.java:1607)
> at com.sleepycat.je.Cursor.searchInternal(Cursor.java:2898)
> at com.sleepycat.je.Cursor.searchAllowPhantoms(Cursor.java:2732)
> at com.sleepycat.je.Cursor.searchNoDups(Cursor.java:2586)
> at com.sleepycat.je.Cursor.search(Cursor.java:2553)
> at com.sleepycat.je.Cursor.getSearchKeyRange(Cursor.java:1757)
> at
> org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.visitMessageInstances(AbstractBDBMessageStore.java:203)
> ... 8 more
> {noformat}
> I believe the call to {{cursor.getSearchKeyRange(key,value,
> LockMode.DEFAULT)}} has the same problem just under different circumstances.
> I have not checked whether other similar defects exist.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]