[jira] [Updated] (AMQ-6005) Slave broker startup corrupts shared PList storage

Timothy Bish (JIRA) Thu, 10 Dec 2015 06:57:06 -0800

     [ 
https://issues.apache.org/jira/browse/AMQ-6005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Timothy Bish updated AMQ-6005:
------------------------------
    Fix Version/s: 5.13.1

> Slave broker startup corrupts shared PList storage
> --------------------------------------------------
>
>                 Key: AMQ-6005
>                 URL: https://issues.apache.org/jira/browse/AMQ-6005
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: KahaDB
>    Affects Versions: 5.7.0, 5.10.0
>         Environment: RHLinux6
>            Reporter: Volker Kleinschmidt
>            Assignee: Gary Tully
>             Fix For: 5.13.1, 5.14.0
>
>
> h4. Background
> When multiple JVMs run AMQ in a master/slave configuration with the broker 
> directory in a shared filesystem location (as is required e.g. for 
> kahaPersistence), and when due to high message volume or slow producers the 
> broker's memory needs exceed the configured memory usage limit, AMQ will 
> overflow asynchronous messages to a PList store inside the "tmp_storage" 
> subdirectory of said shared broker directory.
> h4. Issue
> We frequently observed this tmpDB store getting corrupted with "stale NFS 
> filehandle" errors for tmpDB.data, tmpDB.redo, and some journal files, all of 
> which suddenly went missing from the tmp_storage folder. This puts the entire 
> broker into a bad state from which it cannot recover. Only restarting the 
> service (which causes a broker slave to take over and loses the 
> yet-undelivered messages) gets a working state back.
> h4. Symptoms
> Stack trace:
> {noformat}
> ...
> Caused by: java.io.IOException: Stale file handle
>       at java.io.RandomAccessFile.readBytes0(Native Method)
>       at java.io.RandomAccessFile.readBytes(RandomAccessFile.java:350)
>       at java.io.RandomAccessFile.read(RandomAccessFile.java:385)
>       at java.io.RandomAccessFile.readFully(RandomAccessFile.java:444)
>       at java.io.RandomAccessFile.readFully(RandomAccessFile.java:424)
>       at org.apache.kahadb.page.PageFile.readPage(PageFile.java:876)
>       at org.apache.kahadb.page.Transaction$2.readPage(Transaction.java:446)
>       at org.apache.kahadb.page.Transaction$2.<init>(Transaction.java:437)
>       at 
> org.apache.kahadb.page.Transaction.openInputStream(Transaction.java:434)
>       at org.apache.kahadb.page.Transaction.load(Transaction.java:410)
>       at org.apache.kahadb.page.Transaction.load(Transaction.java:367)
>       at org.apache.kahadb.index.ListIndex.loadNode(ListIndex.java:306)
>       at org.apache.kahadb.index.ListIndex.getHead(ListIndex.java:99)
>       at org.apache.kahadb.index.ListIndex.iterator(ListIndex.java:284)
>       at 
> org.apache.activemq.store.kahadb.plist.PList$PListIterator.<init>(PList.java:199)
>       at org.apache.activemq.store.kahadb.plist.PList.iterator(PList.java:189)
>       at 
> org.apache.activemq.broker.region.cursors.FilePendingMessageCursor$DiskIterator.<init>(FilePendingMessageCursor.java:496)
> {noformat}
> h4. Cause
> During BrokerThread startup, the BrokerService.startPersistenceAdapter() 
> method is called, which  via doStartPersistenceAdapter() and 
> getProducerSystemUsage() invokes getSystemUsage(), that calls 
> getTempDataStore(), and that method summarily cleans out the existing 
> contents of the tmp_storage directory.
> All of this happens *before* the broker lock is obtained in the 
> PersistenceAdapter.start() method at the end of doStartPersistenceAdapter().
> So a JVM that doesn't get to be the broker (because there already is one) and 
> runs in slave mode (waiting to obtain the broker lock) interferes with and 
> corrupts the running broker's tmp_storage and thus breaks the broker. That's 
> a critical bug. The slave has no business starting up the persistence adapter 
> and cleaning out data as it hasn't gotten the lock yet, so isn't allowed to 
> do any work, period. 
> h4. Workaround
> As workaround, an unshared local directory needs to be specified as 
> tempDirectory for the broker, even if the main broker directory is shared. 
> Also, since broker startup will clear the tmp_storage out anyway, there 
> really is no advantage to having this in a shared location - since the next 
> broker that starts up after a broker failure will never re-use that data 
> anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (AMQ-6005) Slave broker startup corrupts shared PList storage

Reply via email to