[
https://issues.apache.org/jira/browse/ARTEMIS-4928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890139#comment-17890139
]
Justin Bertram edited comment on ARTEMIS-4928 at 2/6/25 5:36 PM:
-----------------------------------------------------------------
A bit of an update. We have continued testing with a single broker (2.36.0) in
the JVM. The incidence of the underlying problem is now much less frequent.
It takes us about 3x the number of runs of our performance tests before we see
a lock up now, vs. when we had dual brokers in the JVM. Ultimately it looks
like the journal operations just get "stuck" and the callbacks don't fire
indicating they have been processed. Here is an example trace to the GC root
of a message ID (de-dup ID)
!image-2024-10-16-16-08-29-866.png|width=575,height=175!
Interestingly, in this particular instance, I managed to persuade the sender to
tear down the session and resend (I did a {{tcpkill}} on the socket and it's
designed to reconnect). Unfortunately, it got de-duplicated by Artemis due to
the de-duplication cache, so didn't resolve my problem and required me to
restart the JVM containing the broker. Everything then was unjammed and
continued. I'd been thinking about detecting very slow acks for sends, and if
deemed too slow and likely stuck, tearing down the session and re-establishing
and re-sending as a work-around, and clearly I would need to turn off the
de-duplication cache. I'd need to think harder about the consequences of that,
but we do our own de-duplication vs. a database when processing the messages
anyway so superficially it shouldn't break anything, and would now have to pick
up some of de-duping the contents of the confirmation window that Artemis is
absorbing currently.
Any further thoughts or suggestions?
I still don't have a self contained reproduction that I could share.
was (Author: parkri):
A bit of an update. We have continued testing with a single broker (2.36.0) in
the JVM. The incidence of the underlying problem is now much less frequent.
It takes us about 3x the number of runs of our performance tests before we see
a lock up now, vs. when we had dual brokers in the JVM. Ultimately it looks
like the journal operations just get "stuck" and the callbacks don't fire
indicating they have been processed. Here is an example trace to the GC root
of a message ID (de-dup ID)
!image-2024-10-16-16-08-29-866.png!
Interestingly, in this particular instance, I managed to persuade the sender to
tear down the session and resend (I did a {{tcpkill}} on the socket and it's
designed to reconnect). Unfortunately, it got de-duplicated by Artemis due to
the de-duplication cache, so didn't resolve my problem and required me to
restart the JVM containing the broker. Everything then was unjammed and
continued. I'd been thinking about detecting very slow acks for sends, and if
deemed too slow and likely stuck, tearing down the session and re-establishing
and re-sending as a work-around, and clearly I would need to turn off the
de-duplication cache. I'd need to think harder about the consequences of that,
but we do our own de-duplication vs. a database when processing the messages
anyway so superficially it shouldn't break anything, and would now have to pick
up some of de-duping the contents of the confirmation window that Artemis is
absorbing currently.
Any further thoughts or suggestions?
I still don't have a self contained reproduction that I could share.
> SendAcknowledgementHandler not getting called
> ---------------------------------------------
>
> Key: ARTEMIS-4928
> URL: https://issues.apache.org/jira/browse/ARTEMIS-4928
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Components: Broker
> Affects Versions: 2.32.0, 2.35.0, 2.36.0
> Environment: The environment is Linux based, with Azul Java 17. I
> can update with more precise details if needed.
> Artemis version is 2.32.0. However, Artemis broker and the application (and
> thus client producer) are in the same JVM with socket transports.
> We do not see any exceptions in our logs.
>
> Reporter: Rick Parker
> Priority: Critical
> Attachments: image-2024-07-17-13-41-55-900.png,
> image-2024-07-17-13-49-53-962.png, image-2024-08-12-17-44-03-698.png,
> image-2024-08-12-17-45-55-812.png, image-2024-08-12-18-36-35-508.png,
> image-2024-08-12-18-37-10-723.png, image-2024-08-12-18-37-44-279.png,
> image-2024-09-04-15-09-31-595.png, image-2024-09-04-15-10-04-610.png,
> image-2024-10-16-16-08-29-866.png
>
>
> We have been using ArtemisMQ since 2016, and recently upgrading from 2.19.1
> on JDK8 to 2.32.0 on JDK17. We occasionally experience what looks like a
> failure to acknowledge the sending of a message by a (CORE) producer, since
> doing that upgrade, and it brings our application to a halt.
> When I say occasionally, we have a nightly performance test of our
> application that sends about 20-30 million messages from the one producer.
> This failure to acknowledge the send so far has happened twice in the space
> of about a month, which means it is happening approximately every 250-400
> million messages or perhaps more. This also means we don't currently have a
> self contained reproduction of the problem. We are starting to think about
> how we might reproduce it more frequently, if possible, since we have now
> seen it twice and have gained a tiny bit more understanding.
> The symptom is a failure to be called back from the send, and inspecting a
> heap dump I _think_ confirms that the producer is sitting on a send - but I
> am not an expert on the internal workings of Artemis and many apologies in
> advance if I either mislead or point fingers inappropriately.
> We will try upgrading to the latest 2.35.0 (as at time of writing) to see if
> it goes away - the fixed issues don't immediately shout out that it might be
> solved however.
> The API from which we do not get called back is:
> {{org.apache.activemq.artemis.api.core.client.ClientProducer.}}
> {{send(SimpleString address, Message message, SendAcknowledgementHandler
> handler)}}
> Can a misbehaving handler/callback somehow cause this? e.g. what happens if
> it throws an exception? (which we are not seeing bubble up anywhere, but
> haven not ruled it out)
> I have a screenshot of what looks like an interesting part of the heap dump -
> the {{{}ChannelImpl{}}}. To my eyes the {{firstStoredCommandID}} value looks
> out of sync with the content ({{{}correlationID{}}} of message) of the
> {{resendCache}} which is lagging behind for some reason. 8,815,497 is the
> message that has not had the handler called. But like I say, I'm looking at
> all this for the first time with little understanding.
> !image-2024-07-17-13-41-55-900.png!
> It also looks like the same message is still present in the broker data
> structures / heap dump, along with 8,815,495
> !image-2024-07-17-13-49-53-962.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact