[jira] [Created] (ARTEMIS-5086) Cluster connection randomly fails and stop message redistribution

Jean-Pascal Briquet (Jira) Thu, 03 Oct 2024 08:01:10 -0700

Jean-Pascal Briquet created ARTEMIS-5086:
--------------------------------------------


             Summary: Cluster connection randomly fails and stop message 
redistribution
                 Key: ARTEMIS-5086
                 URL: https://issues.apache.org/jira/browse/ARTEMIS-5086
             Project: ActiveMQ Artemis
          Issue Type: Bug
          Components: Broker, Clustering
    Affects Versions: 2.36.0, 2.35.0, 2.30.0
            Reporter: Jean-Pascal Briquet
         Attachments: cluster-connections-stop.log

*Context:*
In a cluster of 3 primary/backup pairs, it can happen that the cluster 
connection randomly fails and does not automatically recover.
The frequency of the problem is random and it can happens once every few weeks.
When cluster-connectivity is degraded, it stops the message flow between 
brokers and interrupts the message redistribution.
Not all cluster nodes may be affected, some may still maintain 
cluster-connectivity, while others are partially affected, and some can lose 
all connectivity.
There are no errors visible in logs when the issue occurs.

 

*Workaround:*
An operator has to stop and start the cluster-connection from the JMX 
management.
This means that the message redistribution can be interrupted for a potentially 
long time until it is manually restarted.

 

*How to recognize the problem:*
The cluster-connections JMX panel indicates that:
- cluster-connectivity is started
- topology is correct and contains all nodes (3 members, 6 nodes)
- nodes fields is either empty, or contains only one entry (instead of two when 
everything works). In my opinion, this is the main indicator, when it works 
well, nodes should equal = members in topology - 1

 

*Consequences:*
- Messages are stuck in $.artemis.internal.sf.artemis-cluster.* queues until 
the cluster connection is restarted.
- Messages are stuck in notif.* queues until the cluster connection is restarted
- Consumers are starved as message redistribution is broken

 

*Potential trigger ?*
I have observed this issue several times over the past months, but 
unfortunately, I don't have a reproduction case.
I would have preferred something more predictable but it seems to be a random 
problem.

When the issue occured this week, I noticed a strange coincidence, we deployed 
a configuration change (addition of 10 new addresses) at the same time on two 
different clusters.
Configuration refresh is enabled, and during the upgrade process, we touch the 
broker.xml to trigger the config reload (so 6 * 2 nodes had configuration 
reloaded).
On both clusters, one node had correct cluster connectivity (nodes=2), one node 
only one connection (nodes=1), and one node no connections at all (nodes=0).
Maybe I'm wrong, but the fact that it happened on two clusters after the same 
operation let me think there is maybe something related.


Please note that most of the time the config reload is working very well and it 
does not impact cluster-connections.

 

*Investigation:*
Since I don't have a clear reproduction scenario, I checked the code to 
understand when the "ClusterConnectionImpl.getNodes()" could return an empty 
list.
It seems that nodes are not listed when:
- record list is empty, or
- record list has elements but session is null, or
- record list has elements but forward connection is null

During the last incident, we have enabled TRACE level on packages 
"org.apache.activemq.artemis.core.server.cluster" and 
"org.apache.activemq.artemis.core.client".


When we performed the stop operation on cluster-connections the traces 
indicated that:
- record list had two entries (2 bridges, which is good)
- session had a value (not sure about sessionConsumer)
- forward connection is the last element that could be null
These stop traces are provided in attachment, if you want to review them.

Based on that, I believe the list was empty because: "forward connection was 
null".

The getNode contains a specific null check for the forward connection, so it 
seems that this null state can occur occasionally. When could it happen ?


I would expect the bridge auto-reconnection logic to restore the connection, 
but it does not seems to detect it as it never recover.

 

Sorry it is a bit vague, but if you have tips for further investigation, I 
would be happy to try and provide more information.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact

[jira] [Created] (ARTEMIS-5086) Cluster connection randomly fails and stop message redistribution

Reply via email to