Markus Meierhofer created ARTEMIS-3174:
------------------------------------------
Summary: ServerSessionImpl doesn't set new connection when
transferring session
Key: ARTEMIS-3174
URL: https://issues.apache.org/jira/browse/ARTEMIS-3174
Project: ActiveMQ Artemis
Issue Type: Bug
Components: Broker
Affects Versions: 2.17.0, 2.16.0
Environment: We are using CORE connections between the broker and our
clients.
Our clients are using the JMS abstraction layer.
The clients are using the following options:
* confirmationWindowSize of 1024*1024 (=1MiB)
* connectionTTL of 60 seconds
* reconnectAttempts and initialConnectAttempts set to -1 (infinite retry)
We found the issue with Artemis version 2.16.0, but I checked latest master and
the bug still exists there.
Reporter: Markus Meierhofer
h3. How the issue was discovered
Because of a different bug in the broker where sometimes, closing a connection
didn't clean up the connected sessions
(https://issues.apache.org/jira/browse/ARTEMIS-2870), we implemented a
"monitor" process running alongside the broker which, via JMX remote
management, checks the existing connection IDs and compares them with the
connection IDs it retrieved from existing consumers/producers. When a
connectionID of a consumer/producer isn't contained in the list of active
connection IDs, the monitor process closed the session attached to the
consumer/producer.
At a customer site, we had the issue that our monitor process seemingly closed
sessions of alive connections and not closed ones. This was due to the fact
that, after a connection failover by the client, the consumer/producer JSON
object retrieved via JMX still reported to be using the old connection ID
although the consumer/producer was already using the new connection (when
checking the logs of the artemis client after failover).
The scenario is described down below. It shows the logs of the artemis client
on the one hand and of our "monitor" process connected to the broker via JMX on
the other hand, which every 10 seconds prints all connection IDs and the
consumer/producerInfo retrieved.
At 10:36:45, the artemis client recognized a connection failure:
{noformat}
[WARN 2021-03-09 10:36:51,522 l-threads)
org.apache.activemq.artemis.core.client]: AMQ212037: Connection failure to
fms/10.221.5.49:61616 has been detected: AMQ219014: Timed out after waiting
4,000 ms for response when sending packet 71
[code=CONNECTION_TIMEDOUT]{noformat}
On the monitor process, you can see that there are 3 connection IDs (b20b08e0
is the old connection of the client which failed):
{noformat}
[DEBUG 2021-03-09 10:36:45,858 cheduler-1
fms.jmx.ClearDeadBrokerSessionsScheduler]: Found connectionIDs: [b20b08e0,
8080f955, 5dc2dff9]
Found consumers:
...,{"consumerID":0,"connectionID":"b20b08e0","sessionID":"237f52c1-8039-11eb-b5ae-2acf578dd699","queueName":"incubed.queue.robot_1000.robot_command","browseOnly":false,"creationTime":1615226846214,"deliveringCount":0},...
{noformat}
At 10:37:09 (25 seconds later), the client is able to successfully fail over.
It seems that he is doing session reattachment instead of session recreate:
{noformat}
[ERROR 2021-03-09 10:37:09,307 ead-289224
nt.control.RestartAgentExceptionListener]: JMS exception listener retrieved
exception!
{noformat}
The monitor process now shows 4 connections (4cc50064 is the new connection of
the client). Also note that there are still consumers (and their sessions)
which report the old connection ID. But these sessions are actually used by the
client (he has successfully transferred the sessions). We conclude this from
the fact that there are NO consumers/producers with the new connection ID, but
the client is able to communicate fully with the broker after successful
failover. Also, the session IDs do not change (which indicates a session
transfer):
{noformat}
[DEBUG 2021-03-09 10:37:15,900 cheduler-9
fms.jmx.ClearDeadBrokerSessionsScheduler]: Found connectionIDs: [4cc50064,
b20b08e0, 8080f955, 5dc2dff9]
[DEBUG 2021-03-09 10:37:15,901 cheduler-9
fms.jmx.ClearDeadBrokerSessionsScheduler]: Found
consumers:...,{"consumerID":0,"connectionID":"b20b08e0","sessionID":"237f52c1-8039-11eb-b5ae-2acf578dd699","queueName":"incubed.queue.robot_1000.robot_command","browseOnly":false,"creationTime":1615226846214,"deliveringCount":0},...
{noformat}
~60 seconds after the failure of the old connection, the broker closes the
connection. Afterwards, the monitor process only shows 3 connection (b20b08e0
was closed). But you can see that there are still consumers reporting the old
connection ID (these are consumers of sessions that have actually been
transferred to the new connection):
{noformat}
[DEBUG 2021-03-09 10:37:55,973 heduler-10
fms.jmx.ClearDeadBrokerSessionsScheduler]: Found connectionIDs: [4cc50064,
8080f955, 5dc2dff9]
[DEBUG 2021-03-09 10:37:55,973 heduler-10
fms.jmx.ClearDeadBrokerSessionsScheduler]: Found consumers:
...,{"consumerID":0,"connectionID":"b20b08e0","sessionID":"237f52c1-8039-11eb-b5ae-2acf578dd699","queueName":"incubed.queue.robot_1000.robot_command","browseOnly":false,"creationTime":1615226846214,"deliveringCount":0},...
{noformat}
Afterwards, our monitor process goes on closing the sessions of the consumers
with the old connection ID, although they were actually used by the new
connection. This effectively breaks the connection of the artemis client.
h3. The bug
After looking through the code, we found a problem during session reattachment:
In "internaltransferConnection" of "ServerSessionPacketHandler", the session is
transferred from the old to the new connection. It seems mostly correct as the
new connection is set both on the channel and on the ServerSessionPacketHandler
itself. But on the "session" member variable (an instance of
ServerSessionImpl), the new connection is not set, although ServerSessionImpl
has a field "remotingConnection". In fact, the remotingConnectionField is only
set once in the constructor, which cannot be correct if the ServerSessionImpl
object is reused from the old to the new connection.
Among other usages, the remotingConnection field of ServerSessionImpl is used
to retrieve the connectionID of the session's connection, which in the end is
used by the JMX server control (ActiveMQServerControlImpl) in order to build a
JSON representation of a consumer.
h3. A simple solution
The most direct solution would be to set the new connection on the
remotingConnection field of ServerSessionImpl. I will soon post a pull request
here containing this change.
There are some considerations though:
- Do we need to care about locking when setting the field?
- The "remotingConnection" field is currently final in ServerSessionImpl, and
always was since the initial HornetQ donation, I'm not sure what the intention
behind it was. But on the other hand, the session reattachment also worked the
same way, so the bug of keeping the old connection on the ServerSessionImpl
also existed back then.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)