[jira] [Updated] (AMQ-6005) Slave broker startup corrupts shared PList storage

Volker Kleinschmidt (JIRA) Mon, 12 Oct 2015 21:15:45 -0700

     [ 
https://issues.apache.org/jira/browse/AMQ-6005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Volker Kleinschmidt updated AMQ-6005:
-------------------------------------
    Description: 
h4. Background

When multiple JVMs run AMQ in a master/slave configuration with the broker 
directory in a shared filesystem location (as is required e.g. for 
kahaPersistence), and when due to high message volume or slow producers the 
broker's memory needs exceed the configured memory usage limit, AMQ will 
overflow asynchronous messages to a PList store inside the "tmp_storage" 
subdirectory of said shared broker directory.

h4. Issue

We frequently observed this tmpDB store getting corrupted with "stale NFS 
filehandle" errors for tmpDB.data, tmpDB.redo, and some journal files, all of 
which suddenly went missing from the tmp_storage folder. This puts the entire 
broker into a bad state from which it cannot recover. Only restarting the 
service (which causes a broker slave to take over and loses the yet-undelivered 
messages) gets a working state back.

h4. Symptoms

Stack trace:
{noformat}
...
Caused by: java.io.IOException: Stale file handle
        at java.io.RandomAccessFile.readBytes0(Native Method)
        at java.io.RandomAccessFile.readBytes(RandomAccessFile.java:350)
        at java.io.RandomAccessFile.read(RandomAccessFile.java:385)
        at java.io.RandomAccessFile.readFully(RandomAccessFile.java:444)
        at java.io.RandomAccessFile.readFully(RandomAccessFile.java:424)
        at org.apache.kahadb.page.PageFile.readPage(PageFile.java:876)
        at org.apache.kahadb.page.Transaction$2.readPage(Transaction.java:446)
        at org.apache.kahadb.page.Transaction$2.<init>(Transaction.java:437)
        at 
org.apache.kahadb.page.Transaction.openInputStream(Transaction.java:434)
        at org.apache.kahadb.page.Transaction.load(Transaction.java:410)
        at org.apache.kahadb.page.Transaction.load(Transaction.java:367)
        at org.apache.kahadb.index.ListIndex.loadNode(ListIndex.java:306)
        at org.apache.kahadb.index.ListIndex.getHead(ListIndex.java:99)
        at org.apache.kahadb.index.ListIndex.iterator(ListIndex.java:284)
        at 
org.apache.activemq.store.kahadb.plist.PList$PListIterator.<init>(PList.java:199)
        at org.apache.activemq.store.kahadb.plist.PList.iterator(PList.java:189)
        at 
org.apache.activemq.broker.region.cursors.FilePendingMessageCursor$DiskIterator.<init>(FilePendingMessageCursor.java:496)
{noformat}

h4. Cause

During BrokerThread startup, the BrokerService.startPersistenceAdapter() method 
is called, which  via doStartPersistenceAdapter() and getProducerSystemUsage() 
invokes getSystemUsage(), that calls getTempDataStore(), and that method 
summarily cleans out the existing contents of the tmp_storage directory.
All of this happens *before* the broker lock is obtained in the 
PersistenceAdapter.start() method at the end of doStartPersistenceAdapter().

So a JVM that doesn't get to be the broker (because there already is one) and 
runs in slave mode (waiting to obtain the broker lock) interferes with and 
corrupts the running broker's tmp_storage and thus breaks the broker. That's a 
critical bug. The slave has no business starting up the persistence adapter and 
cleaning out data as it hasn't gotten the lock yet, so isn't allowed to do any 
work, period. 

h4. Workaround

As workaround, an unshared local directory needs to be specified as 
tempDirectory for the broker, even if the main broker directory is shared. 
Also, since broker startup will clear the tmp_storage out anyway, there really 
is no advantage to having this in a shared location - since the next broker 
that starts up after a broker failure will never re-use that data anyway.

  was:
h4. Background

When multiple JVMs run AMQ in a master/slave configuration with the broker 
directory in a shared filesystem location (as is required e.g. for 
kahaPersistence), and when due to high message volume or slow producers the 
broker's memory needs exceed the configured memory usage limit, AMQ will 
overflow asynchronous messages to a PList store inside the "tmp_storage" 
subdirectory of said shared broker directory.

h4. Issue

We frequently observed this tmpDB store getting corrupted with "stale NFS 
filehandle" errors for tmpDB.data, tmpDB.redo, and some journal files, all of 
which suddenly went missing from the tmp_storage folder. This puts the entire 
broker into a bad state from which it cannot recover. Only restarting the 
service (which causes a broker slave to take over and loses the yet-undelivered 
messages) gets a working state back.

h4. Symptoms

Stack trace:
{noformat}
...
Caused by: java.io.IOException: Stale file handle
        at java.io.RandomAccessFile.readBytes0(Native Method)
        at java.io.RandomAccessFile.readBytes(RandomAccessFile.java:350)
        at java.io.RandomAccessFile.read(RandomAccessFile.java:385)
        at java.io.RandomAccessFile.readFully(RandomAccessFile.java:444)
        at java.io.RandomAccessFile.readFully(RandomAccessFile.java:424)
        at org.apache.kahadb.page.PageFile.readPage(PageFile.java:876)
        at org.apache.kahadb.page.Transaction$2.readPage(Transaction.java:446)
        at org.apache.kahadb.page.Transaction$2.<init>(Transaction.java:437)
        at 
org.apache.kahadb.page.Transaction.openInputStream(Transaction.java:434)
        at org.apache.kahadb.page.Transaction.load(Transaction.java:410)
        at org.apache.kahadb.page.Transaction.load(Transaction.java:367)
        at org.apache.kahadb.index.ListIndex.loadNode(ListIndex.java:306)
        at org.apache.kahadb.index.ListIndex.getHead(ListIndex.java:99)
        at org.apache.kahadb.index.ListIndex.iterator(ListIndex.java:284)
        at 
org.apache.activemq.store.kahadb.plist.PList$PListIterator.<init>(PList.java:199)
        at org.apache.activemq.store.kahadb.plist.PList.iterator(PList.java:189)
        at 
org.apache.activemq.broker.region.cursors.FilePendingMessageCursor$DiskIterator.<init>(FilePendingMessageCursor.java:496)
{noformat}

h4. Cause

During BrokerThread startup, the BrokerService.startPersistenceAdapter() method 
is called, which eventually invokes getSystemUsage(), that calls 
getTempDataStore(), and that summarily cleans out the existing contents of the 
tmp_storage directory. All of this happens before the broker lock is obtained 
in the startBroker() method. So a JVM that doesn't get to be the broker 
(because there already is one) and runs in slave mode (waiting to obtain the 
broker lock) interferes with and corrupts the running broker's tmp_storage and 
thus breaks the broker. That's a critical bug. The slave has no business 
starting up the persistence adapter and cleaning out data as it hasn't gotten 
the lock yet, so isn't allowed to do any work, period. 

h4. Workaround

As workaround, an unshared local directory needs to be specified as 
tempDirectory for the broker, even if the main broker directory is shared. 
Also, since broker startup will clear the tmp_storage out anyway, there really 
is no advantage to having this in a shared location - since the next broker 
that starts up after a broker failure will never re-use that data anyway.


> Slave broker startup corrupts shared PList storage
> --------------------------------------------------
>
>                 Key: AMQ-6005
>                 URL: https://issues.apache.org/jira/browse/AMQ-6005
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: KahaDB
>    Affects Versions: 5.7.0, 5.10.0
>         Environment: RHLinux6
>            Reporter: Volker Kleinschmidt
>
> h4. Background
> When multiple JVMs run AMQ in a master/slave configuration with the broker 
> directory in a shared filesystem location (as is required e.g. for 
> kahaPersistence), and when due to high message volume or slow producers the 
> broker's memory needs exceed the configured memory usage limit, AMQ will 
> overflow asynchronous messages to a PList store inside the "tmp_storage" 
> subdirectory of said shared broker directory.
> h4. Issue
> We frequently observed this tmpDB store getting corrupted with "stale NFS 
> filehandle" errors for tmpDB.data, tmpDB.redo, and some journal files, all of 
> which suddenly went missing from the tmp_storage folder. This puts the entire 
> broker into a bad state from which it cannot recover. Only restarting the 
> service (which causes a broker slave to take over and loses the 
> yet-undelivered messages) gets a working state back.
> h4. Symptoms
> Stack trace:
> {noformat}
> ...
> Caused by: java.io.IOException: Stale file handle
>       at java.io.RandomAccessFile.readBytes0(Native Method)
>       at java.io.RandomAccessFile.readBytes(RandomAccessFile.java:350)
>       at java.io.RandomAccessFile.read(RandomAccessFile.java:385)
>       at java.io.RandomAccessFile.readFully(RandomAccessFile.java:444)
>       at java.io.RandomAccessFile.readFully(RandomAccessFile.java:424)
>       at org.apache.kahadb.page.PageFile.readPage(PageFile.java:876)
>       at org.apache.kahadb.page.Transaction$2.readPage(Transaction.java:446)
>       at org.apache.kahadb.page.Transaction$2.<init>(Transaction.java:437)
>       at 
> org.apache.kahadb.page.Transaction.openInputStream(Transaction.java:434)
>       at org.apache.kahadb.page.Transaction.load(Transaction.java:410)
>       at org.apache.kahadb.page.Transaction.load(Transaction.java:367)
>       at org.apache.kahadb.index.ListIndex.loadNode(ListIndex.java:306)
>       at org.apache.kahadb.index.ListIndex.getHead(ListIndex.java:99)
>       at org.apache.kahadb.index.ListIndex.iterator(ListIndex.java:284)
>       at 
> org.apache.activemq.store.kahadb.plist.PList$PListIterator.<init>(PList.java:199)
>       at org.apache.activemq.store.kahadb.plist.PList.iterator(PList.java:189)
>       at 
> org.apache.activemq.broker.region.cursors.FilePendingMessageCursor$DiskIterator.<init>(FilePendingMessageCursor.java:496)
> {noformat}
> h4. Cause
> During BrokerThread startup, the BrokerService.startPersistenceAdapter() 
> method is called, which  via doStartPersistenceAdapter() and 
> getProducerSystemUsage() invokes getSystemUsage(), that calls 
> getTempDataStore(), and that method summarily cleans out the existing 
> contents of the tmp_storage directory.
> All of this happens *before* the broker lock is obtained in the 
> PersistenceAdapter.start() method at the end of doStartPersistenceAdapter().
> So a JVM that doesn't get to be the broker (because there already is one) and 
> runs in slave mode (waiting to obtain the broker lock) interferes with and 
> corrupts the running broker's tmp_storage and thus breaks the broker. That's 
> a critical bug. The slave has no business starting up the persistence adapter 
> and cleaning out data as it hasn't gotten the lock yet, so isn't allowed to 
> do any work, period. 
> h4. Workaround
> As workaround, an unshared local directory needs to be specified as 
> tempDirectory for the broker, even if the main broker directory is shared. 
> Also, since broker startup will clear the tmp_storage out anyway, there 
> really is no advantage to having this in a shared location - since the next 
> broker that starts up after a broker failure will never re-use that data 
> anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (AMQ-6005) Slave broker startup corrupts shared PList storage

Reply via email to