[
https://issues.apache.org/jira/browse/AMQ-6005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Timothy Bish updated AMQ-6005:
------------------------------
Fix Version/s: 5.13.1
> Slave broker startup corrupts shared PList storage
> --------------------------------------------------
>
> Key: AMQ-6005
> URL: https://issues.apache.org/jira/browse/AMQ-6005
> Project: ActiveMQ
> Issue Type: Bug
> Components: KahaDB
> Affects Versions: 5.7.0, 5.10.0
> Environment: RHLinux6
> Reporter: Volker Kleinschmidt
> Assignee: Gary Tully
> Fix For: 5.13.1, 5.14.0
>
>
> h4. Background
> When multiple JVMs run AMQ in a master/slave configuration with the broker
> directory in a shared filesystem location (as is required e.g. for
> kahaPersistence), and when due to high message volume or slow producers the
> broker's memory needs exceed the configured memory usage limit, AMQ will
> overflow asynchronous messages to a PList store inside the "tmp_storage"
> subdirectory of said shared broker directory.
> h4. Issue
> We frequently observed this tmpDB store getting corrupted with "stale NFS
> filehandle" errors for tmpDB.data, tmpDB.redo, and some journal files, all of
> which suddenly went missing from the tmp_storage folder. This puts the entire
> broker into a bad state from which it cannot recover. Only restarting the
> service (which causes a broker slave to take over and loses the
> yet-undelivered messages) gets a working state back.
> h4. Symptoms
> Stack trace:
> {noformat}
> ...
> Caused by: java.io.IOException: Stale file handle
> at java.io.RandomAccessFile.readBytes0(Native Method)
> at java.io.RandomAccessFile.readBytes(RandomAccessFile.java:350)
> at java.io.RandomAccessFile.read(RandomAccessFile.java:385)
> at java.io.RandomAccessFile.readFully(RandomAccessFile.java:444)
> at java.io.RandomAccessFile.readFully(RandomAccessFile.java:424)
> at org.apache.kahadb.page.PageFile.readPage(PageFile.java:876)
> at org.apache.kahadb.page.Transaction$2.readPage(Transaction.java:446)
> at org.apache.kahadb.page.Transaction$2.<init>(Transaction.java:437)
> at
> org.apache.kahadb.page.Transaction.openInputStream(Transaction.java:434)
> at org.apache.kahadb.page.Transaction.load(Transaction.java:410)
> at org.apache.kahadb.page.Transaction.load(Transaction.java:367)
> at org.apache.kahadb.index.ListIndex.loadNode(ListIndex.java:306)
> at org.apache.kahadb.index.ListIndex.getHead(ListIndex.java:99)
> at org.apache.kahadb.index.ListIndex.iterator(ListIndex.java:284)
> at
> org.apache.activemq.store.kahadb.plist.PList$PListIterator.<init>(PList.java:199)
> at org.apache.activemq.store.kahadb.plist.PList.iterator(PList.java:189)
> at
> org.apache.activemq.broker.region.cursors.FilePendingMessageCursor$DiskIterator.<init>(FilePendingMessageCursor.java:496)
> {noformat}
> h4. Cause
> During BrokerThread startup, the BrokerService.startPersistenceAdapter()
> method is called, which via doStartPersistenceAdapter() and
> getProducerSystemUsage() invokes getSystemUsage(), that calls
> getTempDataStore(), and that method summarily cleans out the existing
> contents of the tmp_storage directory.
> All of this happens *before* the broker lock is obtained in the
> PersistenceAdapter.start() method at the end of doStartPersistenceAdapter().
> So a JVM that doesn't get to be the broker (because there already is one) and
> runs in slave mode (waiting to obtain the broker lock) interferes with and
> corrupts the running broker's tmp_storage and thus breaks the broker. That's
> a critical bug. The slave has no business starting up the persistence adapter
> and cleaning out data as it hasn't gotten the lock yet, so isn't allowed to
> do any work, period.
> h4. Workaround
> As workaround, an unshared local directory needs to be specified as
> tempDirectory for the broker, even if the main broker directory is shared.
> Also, since broker startup will clear the tmp_storage out anyway, there
> really is no advantage to having this in a shared location - since the next
> broker that starts up after a broker failure will never re-use that data
> anyway.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)