Volker Kleinschmidt created AMQ-6005:
----------------------------------------
Summary: Slave broker startup corrupts shared PList storage
Key: AMQ-6005
URL: https://issues.apache.org/jira/browse/AMQ-6005
Project: ActiveMQ
Issue Type: Bug
Components: KahaDB
Affects Versions: 5.10.0, 5.7.0
Environment: RHLinux6
Reporter: Volker Kleinschmidt
h4. Background
When multiple JVMs run AMQ in a master/slave configuration with the broker
directory in a shared filesystem location (as is required e.g. for
kahaPersistence), and when due to high message volume or slow producers the
broker's memory needs exceed the configured memory usage limit, AMQ will
overflow asynchronous messages to a PList store inside the "tmp_storage"
subdirectory of said shared broker directory.
h4. Issue
We frequently observed this tmpDB store getting corrupted with "stale NFS
filehandle" errors for tmpDB.data, tmpDB.redo, and some journal files, all of
which suddenly went missing from the tmp_storage folder. This puts the entire
broker into a bad state from which it cannot recover. Only restarting the
service (which causes a broker slave to take over and loses the yet-undelivered
messages) gets a working state back.
h4. Symptoms
Stack trace:
{noformat}
...
Caused by: java.io.IOException: Stale file handle
at java.io.RandomAccessFile.readBytes0(Native Method)
at java.io.RandomAccessFile.readBytes(RandomAccessFile.java:350)
at java.io.RandomAccessFile.read(RandomAccessFile.java:385)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:444)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:424)
at org.apache.kahadb.page.PageFile.readPage(PageFile.java:876)
at org.apache.kahadb.page.Transaction$2.readPage(Transaction.java:446)
at org.apache.kahadb.page.Transaction$2.<init>(Transaction.java:437)
at
org.apache.kahadb.page.Transaction.openInputStream(Transaction.java:434)
at org.apache.kahadb.page.Transaction.load(Transaction.java:410)
at org.apache.kahadb.page.Transaction.load(Transaction.java:367)
at org.apache.kahadb.index.ListIndex.loadNode(ListIndex.java:306)
at org.apache.kahadb.index.ListIndex.getHead(ListIndex.java:99)
at org.apache.kahadb.index.ListIndex.iterator(ListIndex.java:284)
at
org.apache.activemq.store.kahadb.plist.PList$PListIterator.<init>(PList.java:199)
at org.apache.activemq.store.kahadb.plist.PList.iterator(PList.java:189)
at
org.apache.activemq.broker.region.cursors.FilePendingMessageCursor$DiskIterator.<init>(FilePendingMessageCursor.java:496)
{noformat}
h4. Cause
During BrokerThread startup, the BrokerService.startPersistenceAdapter() method
is called, which eventually invokes getSystemUsage(), that calls
getTempDataStore(), and that summarily cleans out the existing contents of the
tmp_storage directory. All of this happens before the broker lock is obtained
in the startBroker() method. So a JVM that doesn't get to be the broker
(because there already is one) and runs in slave mode (waiting to obtain the
broker lock) interferes with and corrupts the running broker's tmp_storage and
thus breaks the broker. That's a critical bug. The slave has no business
starting up the persistence adapter and cleaning out data as it hasn't gotten
the lock yet, so isn't allowed to do any work, period.
h4. Workaround
As workaround, an unshared local directory needs to be specified as
tempDirectory for the broker, even if the main broker directory is shared.
Also, since broker startup will clear the tmp_storage out anyway, there really
is no advantage to having this in a shared location - since the next broker
that starts up after a broker failure will never re-use that data anyway.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)