[
https://issues.apache.org/jira/browse/ARTEMIS-5735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Justin Bertram updated ARTEMIS-5735:
------------------------------------
Description:
h3. *Problem Description*
Under heavy load queues get stuck and their consumers do not receive any
messages.
The issue seems to be that there is one Orphaned consumer with messages in
transit so no more messages are being processes by other consumers.
This is how it looks on our web console:
!image-2025-10-31-09-08-39-117.png|width=717,height=157!
This seems to be related to issue ARTEMIS-4476.
The added visibility there helped understand the issue, however it did not fix
it.
----
h3. *Technical Information*
After some investigation, I found that the the issue is a race condition
between the of closing a connection and the creation of a new session.
I have created an example application that reproduces one part of the the issue
and shows that it's possible to have the {{ServerSessionImpl.connectionFailed}}
miss the connection event, so the connection is being closed while the session
was not yet registered to the connection
{{AbstractRemotingConnection.failureListeners}}.
In such cases a Orphaned session and consumer will block the usage of the queue.
In more severe cases (our production) the session is in fact registered to the
{{failureListeners}}, but it was not yet added to the sessions map, so after it
is closed, only then is this closed session added to the sessions map and then
it can not be removed from there by manually closing the session from the web
console.
The relevant code part with the race condition is in {{ActiveMQServerImpl}}:
{code:java}
ServerSessionImpl session = new ServerSessionImpl(...);
sessions.put(name, session); {code}
and {{AbstractRemotingConnection.callFailureListeners}}
The project [^artemis_consumer_issue.tar] attached is what I used to
reproduce the simpler case of just having a Orphaned connection that can be
closed manually.
The {{README.md}} file in the project explains how to reproduce the issue.
This is the general process:
# I'm intervening in the code in order to delay sending the
{{failureListeners}} events by adding a {{Thread.sleep()}}.
# Add a very short ttl in {{broker.xml}} to force the
{{FailureCheckAndFlushThread}} process to close the connections.
# Add some messages to a queue when there are no consumers yet
# Startup the consumers (8 in this case)
# The 8 consumers take more time to handle the messages than the ttl configured
# The connection will be created and terminated very shortly after.
This will trigger a recreation of the sessions and consumers while the
{{failureListeners}} are still stuck on the sleep.
was:
h3. *Problem Description*
Under heavy load queues get stuck and their consumers do not receive any
messages.
The issue seems to be that there is one Orphaned consumer with messages in
transit so no more messages are being processes by other consumers.
This is how it looks on our web console:
!image-2025-10-31-09-08-39-117.png|width=717,height=157!
This seems to be related to issue ARTEMIS-4476.
The added visibility there helped understand the issue, however it did not fix
it.
----
h3. *Technical Information*
After some investigation, I found that the the issue is a race condition
between the of closing a connection and the creation of a new session.
I have created an example application that reproduces one part of the the issue
and shows that it's possible to have the {{ServerSessionImpl.connectionFailed}}
miss the connection event, so the connection is being closed while the session
was not yet registered to the connection
{{AbstractRemotingConnection.failureListeners}}.
In such cases a Orphaned session and consumer will block the usage of the queue.
In more severe cases (our production) the session is in fact registered to the
{{failureListeners}}, but it was not yet added to the sessions map, so after it
is closed, only then is this closed session added to the sessions map and then
it can not be removed from there by manually closing the session from the web
console.
The relevant code part with the race condition is in {{ActiveMQServerImpl}}:
{code:java}
ServerSessionImpl session = new ServerSessionImpl(...);
sessions.put(name, session); {code}
and {{AbstractRemotingConnection.callFailureListeners}}
The project [^artemis_consumer_issue.tar] attached is what I used to
reproduce the simpler case of just having a Orphaned connection that can be
closed manually.
The *README.md* file in the project explains how to reproduce the issue.
This is the general process:
# I'm intervening in the code in order to delay sending the *failureListeners*
events by adding a sleep.
# Add a very short ttl in the broker.xml to force the
*FailureCheckAndFlushThread* process to close the connections.
# Add some messages to a queue when there are no consumers yet
# Startup the consumers (8 in this case)
# The 8 consumers take more time to handle the messages than the ttl configured
# The connection will be created and terminated very shortly after.
This will trigger a recreation of the sessions and consumers while the
*failureListeners* are still stuck on the sleep.
> Queue is stuck with Orphaned consumer and no longer consumes messages
> ---------------------------------------------------------------------
>
> Key: ARTEMIS-5735
> URL: https://issues.apache.org/jira/browse/ARTEMIS-5735
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Affects Versions: 2.35.0
> Reporter: Michael Shemesh
> Priority: Major
> Attachments: artemis_consumer_issue.tar,
> image-2025-10-31-09-08-39-117.png
>
>
> h3. *Problem Description*
> Under heavy load queues get stuck and their consumers do not receive any
> messages.
> The issue seems to be that there is one Orphaned consumer with messages in
> transit so no more messages are being processes by other consumers.
> This is how it looks on our web console:
> !image-2025-10-31-09-08-39-117.png|width=717,height=157!
> This seems to be related to issue ARTEMIS-4476.
> The added visibility there helped understand the issue, however it did not
> fix it.
>
> ----
> h3. *Technical Information*
> After some investigation, I found that the the issue is a race condition
> between the of closing a connection and the creation of a new session.
> I have created an example application that reproduces one part of the the
> issue and shows that it's possible to have the
> {{ServerSessionImpl.connectionFailed}} miss the connection event, so the
> connection is being closed while the session was not yet registered to the
> connection {{AbstractRemotingConnection.failureListeners}}.
> In such cases a Orphaned session and consumer will block the usage of the
> queue.
> In more severe cases (our production) the session is in fact registered to
> the {{failureListeners}}, but it was not yet added to the sessions map, so
> after it is closed, only then is this closed session added to the sessions
> map and then it can not be removed from there by manually closing the session
> from the web console.
> The relevant code part with the race condition is in {{ActiveMQServerImpl}}:
> {code:java}
> ServerSessionImpl session = new ServerSessionImpl(...);
> sessions.put(name, session); {code}
> and {{AbstractRemotingConnection.callFailureListeners}}
> The project [^artemis_consumer_issue.tar] attached is what I used to
> reproduce the simpler case of just having a Orphaned connection that can be
> closed manually.
> The {{README.md}} file in the project explains how to reproduce the issue.
> This is the general process:
> # I'm intervening in the code in order to delay sending the
> {{failureListeners}} events by adding a {{Thread.sleep()}}.
> # Add a very short ttl in {{broker.xml}} to force the
> {{FailureCheckAndFlushThread}} process to close the connections.
> # Add some messages to a queue when there are no consumers yet
> # Startup the consumers (8 in this case)
> # The 8 consumers take more time to handle the messages than the ttl
> configured
> # The connection will be created and terminated very shortly after.
> This will trigger a recreation of the sessions and consumers while the
> {{failureListeners}} are still stuck on the sleep.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact