ActiveMQ broker processing slows with consumption from large store
------------------------------------------------------------------
Key: AMQ-3028
URL: https://issues.apache.org/activemq/browse/AMQ-3028
Project: ActiveMQ
Issue Type: Bug
Components: Broker
Affects Versions: 5.4.1
Environment: CentOS 5.5, Sun JDK 1.6.0_21-b06 64 bit, ActiveMQ 5.4.1,
AMD Athlon(tm) II X2 B22, local disk
Reporter: Arthur Naseef
Priority: Critical
In scalability tests, this problem occured. I have tested a workaround that
appears to function. A fix will gladly be submitted - would like some
guidance, though, on the most appropriate solution.
Here's the summary. Many more details are available upon request.
Root cause:
- Believed to be simultaneous access to LRUCache objects which are not
thread-safe (PageFile's pageCache)
Workaround:
- Synchronize the LRUCache on all access methods (get, put, remove)
The symptoms are as follows:
1. Message rates run fairly-constant until a point in time when they degrade
rather quickly
2. After a while (about 15 minutes), the message rates drop to the floor -
with large numbers of seconds with 0 records passing
3. Using VisualVM or JConsole, note that memory use grows continuosuly
4. When message rates drop to the floor, the VM is spending the vast majority
of its time performing garbage collection
5. Heap dumps show that LRUCache objects (the pageCache members of
PageFile's) are far exceeding their configured limits.
The default limit was used, 10000. A size of over 170,000 entries was
reached.
6. No producer flow control occurred (did not see the flow control log
message)
Test scenario used to reproduce:
- Fast producers (limited to <= 1000 msgs/sec)
-- using transactions
-- 10 msg per transaction
-- message content size 177 bytes
- Slow consumers (limited to <= 10 msg/sec)
-- auto-acknowledge mode; not transacted
- 10 Queues
-- 1 producer per queue
-- 1 consumer per queue
- Producers, Consumers, and Broker all running on different systems, and on
the same system (different test runs).
Note that disk space was not an issue - there was always plenty of disk space
available.
One other interesting note - once a large database of records was stored in
KahaDB, only running consumers, this problem still occurred.
This issue sounds like it may be related to 1764, and 2721. The root cause
sounds the same as 2290 - unsynchronized access to LRUCache.
The most straight-forward solution is to modify all LRUCache objects
(org.apache.kahadb.util.LRUCache, org.apache.activemq.util.LRUCache, ...) to be
concurrent. Another is to create concurrent versions (perhaps
ConcurrentLRUCache) and make use of those at least in PageFile.pageCache.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.