[
https://issues.apache.org/jira/browse/ARTEMIS-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18059883#comment-18059883
]
Frederik Fournes commented on ARTEMIS-5907:
-------------------------------------------
Dear [~tabish],
Thank you for your answer. I will foward this request to the mailing list. This
Ticket can be closed.
Best regards,
Frederik Fournes
> OOME caused by accumulation of PageTransactionInfoImpl and JournalRecord
> objects in paged queue with no consumer
> ----------------------------------------------------------------------------------------------------------------
>
> Key: ARTEMIS-5907
> URL: https://issues.apache.org/jira/browse/ARTEMIS-5907
> Project: Artemis
> Issue Type: Bug
> Affects Versions: 2.39.0
> Reporter: Frederik Fournes
> Priority: Minor
>
> Hi all,
> we are running Apache ActiveMQ Artemis 2.39.0 on OpenJDK 17 in a Kubernetes
> environment and experienced an OutOfMemoryError on our production broker. We
> are assuming that this could be a bug. We have been investigating the root
> cause and would appreciate the community's input on our findings and open
> questions.
> h3. Environment
> - Artemis version: 2.39.0
> - Java: OpenJDK 17 (G1 GC)
> - JVM: -Xms 4G, -Xmx 9G
> - global-max-size: 800M
> h3. Situation
> Our setup uses a software with an internal Artemis broker per endpoint. The
> broker handles message routing between a Business Application (BA), the
> endpoint itself, and a central broker.
> A feature called "AMQP Send Handler" writes a SendEvent into the queue
> `ecp.endpoint.send.event` for every message the endpoint sends. This handler
> was enabled, but no consumer was ever connected to this queue.
> Over approximately 1.5 years of continuous operation, this queue accumulated
> 22,240,016 messages with 0 consumers and 0 acknowledgements.
> h3. The OOME
> The JVM heap showed a sawtooth pattern consistently reaching ~95%, with the
> GC managing to recover each time. Eventually a single spike pushed usage to
> ~99.9% and triggered the OutOfMemoryError.
> h3. Heap analysis
> We ran `jcmd 1 GC.class_histogram` on the production broker and found the
> following top heap consumers:
> |Class|Instances|Bytes|
> |—|—|—|
> |PageTransactionInfoImpl|22,132,340|1,062,352,320 (~1 GB)|
> |ConcurrentHashMap$Node|22,160,468|709,134,976 (~676 MB)|
> |JournalRecord|22,256,890|534,165,360 (~509 MB)|
> |Long|22,132,609|531,182,616 (~506 MB)|
> The instance counts correlate almost exactly with the 22M stuck messages in
> `ecp.endpoint.send.event`. These four object types alone consumed
> approximately 2.8 GB of heap.
> All other objects (AMQPStandardMessage, MessageReferenceImpl, etc.) had
> normal counts (~130K instances), consistent with the actively processed
> queues.
> h3. Resolution
> We purged the 22M messages from the queue using `removeMessages` with a low
> flushLimit. The heap usage dropped significantly after the purge. We also
> disabled the Send Handler to prevent re-accumulation.
> h3. Reproduction attempt (ACCE environment)
> We attempted to reproduce this on a test broker with the same Artemis version
> and identical broker.xml configuration, but with -Xmx 1G. We sent >10M
> messages to the same queue (no consumer). However, the heap histogram showed
> a very different picture:
> |Class|PROD (22M msgs)|ACCE (10M+ msgs)|
> |—|—|—|
> |PageTransactionInfoImpl|22,132,340|188,264|
> |JournalRecord|22,256,890|315,352|
> |MessageReferenceImpl|124,518|127,035|
> Despite having millions of paged messages, the ACCE broker only held ~188K
> PageTransactionInfoImpl objects in heap (vs. 22M in PROD). JVM usage stayed
> stable around 50%.
> h3. Questions for the community
> 1. Can someone confirm that Artemis keeps a PageTransactionInfoImpl and
> JournalRecord in heap for each paged message as long as the message is not
> consumed/acknowledged? Is this by design?
> 2. Why is there such a large discrepancy between PROD and ACCE? Both have the
> same broker configuration, both had millions of paged messages with 0
> consumers. Our hypothesis is that the long-running production environment
> (1.5 years, continuous message flow across other queues) leads to journal
> fragmentation/accumulation that prevents journal compaction from cleaning up
> the PageTransactionInfoImpl records, whereas in the short-lived test scenario
> the compaction process works efficiently. Is this plausible?
> 3. Shouldn't the paging mechanism prevent excactly this szenario, heap
> getting filled up due to a lot of messages?
> Thanks in advance for any insights.
> Best regards
> Frederik
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]