[
https://issues.apache.org/jira/browse/ARTEMIS-4928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17905123#comment-17905123
]
Rick Parker commented on ARTEMIS-4928:
--------------------------------------
Some updates:
# Upgraded to 2.38 and it was still happening
# As previously mentioned, dropping from dual broker to single broker in the
same JVM reduced the frequency of the problem 3x
# We still don't know how to reproduce this outside of performance testing our
whole application deployment
# We have attempted the workaround I previously speculated might work and have
yet to see the workaround fail. We have done 5x the number of test runs it was
taking to reproduce the issue after dropping to a single broker, without any
failures. i.e. 15x the number of runs it previously took to experience a
failure when we first started investigating. Those runs continue.
# The workaround is (or seems to be) to turn off the ID de-duplication buffer
in Artemis (we have our own anyway), and resend any un-ack'd producer sends to
Artemis if they are not ack'd (send callback) within an expected timeframe.
This was very simple to implement for us - we're waiting on a Future and so
added a timeout and then a TimeoutException is a trigger for a resend.
# It seems a stuck send does not block the broker or subsequent send traffic
at all, from the same producer or otherwise.
The latency spike of the timeout and resend is obviously a problem in some
scenarios, but far less of a problem than a stuck message.
> SendAcknowledgementHandler not getting called
> ---------------------------------------------
>
> Key: ARTEMIS-4928
> URL: https://issues.apache.org/jira/browse/ARTEMIS-4928
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Components: Broker
> Affects Versions: 2.32.0, 2.35.0, 2.36.0
> Environment: The environment is Linux based, with Azul Java 17. I
> can update with more precise details if needed.
> Artemis version is 2.32.0. However, Artemis broker and the application (and
> thus client producer) are in the same JVM with socket transports.
> We do not see any exceptions in our logs.
>
> Reporter: Rick Parker
> Priority: Critical
> Attachments: image-2024-07-17-13-41-55-900.png,
> image-2024-07-17-13-49-53-962.png, image-2024-08-12-17-44-03-698.png,
> image-2024-08-12-17-45-55-812.png, image-2024-08-12-18-36-35-508.png,
> image-2024-08-12-18-37-10-723.png, image-2024-08-12-18-37-44-279.png,
> image-2024-09-04-15-09-31-595.png, image-2024-09-04-15-10-04-610.png,
> image-2024-10-16-16-08-29-866.png
>
>
> We have been using ArtemisMQ since 2016, and recently upgrading from 2.19.1
> on JDK8 to 2.32.0 on JDK17. We occasionally experience what looks like a
> failure to acknowledge the sending of a message by a (CORE) producer, since
> doing that upgrade, and it brings our application to a halt.
> When I say occasionally, we have a nightly performance test of our
> application that sends about 20-30 million messages from the one producer.
> This failure to acknowledge the send so far has happened twice in the space
> of about a month, which means it is happening approximately every 250-400
> million messages or perhaps more. This also means we don't currently have a
> self contained reproduction of the problem. We are starting to think about
> how we might reproduce it more frequently, if possible, since we have now
> seen it twice and have gained a tiny bit more understanding.
> The symptom is a failure to be called back from the send, and inspecting a
> heap dump I _think_ confirms that the producer is sitting on a send - but I
> am not an expert on the internal workings of Artemis and many apologies in
> advance if I either mislead or point fingers inappropriately.
> We will try upgrading to the latest 2.35.0 (as at time of writing) to see if
> it goes away - the fixed issues don't immediately shout out that it might be
> solved however.
> The API from which we do not get called back is:
> {{org.apache.activemq.artemis.api.core.client.ClientProducer.}}
> {{send(SimpleString address, Message message, SendAcknowledgementHandler
> handler)}}
> Can a misbehaving handler/callback somehow cause this? e.g. what happens if
> it throws an exception? (which we are not seeing bubble up anywhere, but
> haven not ruled it out)
> I have a screenshot of what looks like an interesting part of the heap dump -
> the {{{}ChannelImpl{}}}. To my eyes the {{firstStoredCommandID}} value looks
> out of sync with the content ({{{}correlationID{}}} of message) of the
> {{resendCache}} which is lagging behind for some reason. 8,815,497 is the
> message that has not had the handler called. But like I say, I'm looking at
> all this for the first time with little understanding.
> !image-2024-07-17-13-41-55-900.png!
> It also looks like the same message is still present in the broker data
> structures / heap dump, along with 8,815,495
> !image-2024-07-17-13-49-53-962.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact