[jira] [Commented] (ARTEMIS-5086) Cluster connection randomly fails and stop message redistribution

Jira Tue, 12 May 2026 00:52:09 -0700


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-5086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18080246#comment-18080246
 ]


Marcin Wołoszczuk commented on ARTEMIS-5086:
--------------------------------------------

[~jpbriquet] did you finally manage to solve this issue? I ran into it 
recently, didn't get any sensible support from the team on the Slack channel. I 
tried what you've said, disabled auto queue deletion, but I also statically 
define addresses + queues, this seems to have solved the issue with messages 
not being redistributed. I also validate this by inspecting Bindings / Remote 
Bindings via JMX in the web console. For auto created/deleted queues it used to 
go blank / only local node. Now it retains all 3 nodes.

> Cluster connection randomly fails and stop message redistribution
> -----------------------------------------------------------------
>
>                 Key: ARTEMIS-5086
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-5086
>             Project: Artemis
>          Issue Type: Bug
>          Components: Broker, Clustering
>    Affects Versions: 2.30.0, 2.35.0, 2.36.0
>            Reporter: Jean-Pascal Briquet
>            Priority: Major
>         Attachments: address-settings.xml, cluster-connections-stop.log, 
> image-2024-10-08-14-26-51-937.png, image-2024-11-21-11-04-58-242.png, 
> image-2024-11-21-11-08-16-869.png, message-events-during-incident-1.log, 
> pr21-broker.xml
>
>
> h3. *Context*
> In a cluster of 3 primary/backup pairs, it can happen that the cluster 
> connection randomly fails and does not automatically recover.
> The frequency of the problem is random and it can happens once every few 
> weeks.
> When cluster-connectivity is degraded, it stops the message flow between 
> brokers and interrupts the message redistribution.
> Not all cluster nodes may be affected, some may still maintain 
> cluster-connectivity, while others are partially affected, and some can lose 
> all connectivity.
> There are no errors visible in logs when the issue occurs.
> h3. *Workaround*
> +Disable auto-deletion+
> Set config-delete-addresses and config-delete-queues to OFF in address 
> settings configuration.
> Remove unneeded queues via JMX or through the administration console, until a 
> correction is available.
>  
> +Restarting cluster-connections+
> The flow can be restored if an operator stop and start the cluster-connection 
> from the JMX management.
> This means that the message redistribution can be interrupted for a 
> potentially long time until it is manually restarted.
> h3. *How to recognize the problem*
> The cluster-connections JMX panel indicates that:
>  - cluster-connectivity is started
>  - topology is correct and contains all nodes (3 members, 6 nodes)
>  - nodes fields is either empty, or contains only one entry (instead of two 
> when everything works). In my opinion, this is the main indicator, when it 
> works well, nodes should be equal to "members in topology - 1"
> The following log appears and is related to the deletion of the 
> cluster-connection queue during a configuration hot-reload :
>  
> {code:java}
> AMQ224077: Undeploying queue $.artemis.internal.sf.......... {code}
>  
> h3. *Consequences*
>  - Messages are stuck in {{$.artemis.internal.sf.artemis-cluster.*}} queues 
> until the cluster connection is restarted.
>  - Messages are stuck in {{notif.*}} queues until the cluster connection is 
> restarted
>  - Consumers are starved as message redistribution is broken
>  - Messages stuck are lost when the cluster-connection is restarted
>  
> h3. *Reproduction scenarios*
> +Configuration+
>  - Artemis cluster with node-1, node-2 and node-3
>  - Configuration reload enabled
> {code:java}
> <configuration-file-refresh-period>5000</configuration-file-refresh-period>{code}
>  - Address-settings containing automatically removable queue configuration
> {code:java}
>   <address-setting match="queue.#">
>     <config-delete-addresses>FORCE</config-delete-addresses>
>     <config-delete-queues>FORCE</config-delete-queues>
>   </address-setting>{code}
>  - Addresses defined
> {code:java}
>   <addresses xmlns="urn:activemq:core">
>     <address name="queue.A">
>         <anycast>
>             <queue name="queue.A"/>
>         </anycast>
>     </address>
>     <address name="queue.B">
>         <anycast>
>             <queue name="queue.B"/>
>         </anycast>
>     </address>
>   </addresses>{code}
>  
> *+Cluster-connection broken reproduction scenario :+*
>  * Start the Artemis cluster and node-1, node-2 and node-3
>  * Remove the "queue.B" address and associated queue from configuration file.
>  * Touch broker.xml (if config is managed externally) to trigger reload
>  * Upon configuration is reloaded:
>  ** $.artemis.internal.sf queues are removed
>  ** cluster-connection bridges are disconnected and enter an inconsistent 
> state (refer to the investigation section below for details).
> +Logs:+
> {code:java}
> 2024-11-25 08:14:43,772 INFO  [org.apache.activemq.artemis.core.server] 
> AMQ221056: Reloading configuration: addresses
> 2024-11-25 08:14:43,773 INFO  [org.apache.activemq.artemis.core.server] 
> AMQ224077: Undeploying queue 
> $.artemis.internal.sf.artemis-cluster.2fe52843-a828-11ef-8368-0242ac14001f
> 2024-11-25 08:14:43,786 INFO  [org.apache.activemq.artemis.core.server] 
> AMQ224077: Undeploying queue 
> $.artemis.internal.sf.artemis-cluster.30699af5-a828-11ef-bbe0-0242ac140015
> 2024-11-25 08:14:43,796 INFO  [org.apache.activemq.artemis.core.server] 
> AMQ224077: Undeploying queue queue.B
> 2024-11-25 08:14:43,802 INFO  [org.apache.activemq.artemis.core.server] 
> AMQ224076: Undeploying address queue.B{code}
>  
> +*Message loss reproduction scenario:*+
>  * Create a consumer connected to queue.A on node-2
>  * Follow the same staeps as in the "Cluster-connection broken" case
>  * New messages sent toward queue.A on node-1 are immediately redistributed 
> and acknowledged.
>  * These messages will never arrive to node-2, and are not in local node-1 
> queues, leading to messages loss.
>  
> h3. *Root cause analysis*
> The configuration reload is removing $.artemis.internal.sf queues by error 
> when an address is deleted from configuration.
> This action triggers the disconnect() method on the cluster connection 
> bridge, leaving it in an inconsistent state and causing message loss.
> Call stack:
> {code:java}
> ActiveMQServerImpl.undeployAddressesAndQueueNotInConfiguration
> -> ActiveMQServerImpl.listQueues(address)
> ---> PostOfficeImpl.listQueuesForAddress{code}
> listQueuesForAddress returns queues based on local bindings AND remote 
> bindings  :
> {code:java}
> 0 = {QueueImpl@10860} 
> "QueueImpl[name=$.artemis.internal.sf.artemis-cluster.30699af5-a828-11ef-bbe0-0242ac140015,
>  postOffice=PostOfficeImpl 
> [server=ActiveMQServerImpl::name=artemis-dc1-primary-1], temp=false]@77dd8218"
> 1 = {QueueImpl@12358} "QueueImpl[name=queue.B, postOffice=PostOfficeImpl 
> [server=ActiveMQServerImpl::name=artemis-dc1-primary-1], temp=false]@7d207349"
> 2 = {QueueImpl@12359} 
> "QueueImpl[name=$.artemis.internal.sf.artemis-cluster.2fe52843-a828-11ef-8368-0242ac14001f,
>  postOffice=PostOfficeImpl 
> [server=ActiveMQServerImpl::name=artemis-dc1-primary-1], 
> temp=false]@6e1ab32f"{code}
> *Possible fixes:*
> +Filter SNF queues+
>  - Cluster snf queues should be filtered from listQueuesForAddress results. 
> This will prevent the configuration reload process from removing these queues.
> +Bridge robustness+
> The "disconnect()" method set the bridge into an inconsistent state.
> Ideally, properly update bridge state to stopped and trigger a reconnection 
> attempt to restore bridge behaviour.
>  
> h3. *Investigation*
> h4. Investigation (2024-10-03)
> Since I don't have a clear reproduction scenario, I checked the code to 
> understand when the {{ClusterConnectionImpl.getNodes()}} could return an 
> empty list.
> It seems that nodes are not listed when:
>  - record list is empty, or
>  - record list has elements but session is null, or
>  - record list has elements but forward connection is null
> During the last incident, we have enabled TRACE level on:
>  * {{org.apache.activemq.artemis.core.server.cluster}}
>  * {{org.apache.activemq.artemis.core.client}}
> When we performed the stop operation on cluster-connections the traces 
> indicated that:
>  - record list had two entries (2 bridges, which is good)
>  - {{session}} had a value (not sure about {{{}sessionConsumer{}}})
>  - forward connection is the last element that could be null
> These stop traces are provided in attachment, if you want to review them.
> Based on that, I believe the list was empty because: "forward connection was 
> null".
> The {{getNode}} contains a specific null check for the forward connection, so 
> it seems that this null state can occur occasionally. When could it happen?
> I would expect the bridge auto-reconnection logic to restore the connection, 
> but it does not seems to detect it as it never recover.
> Sorry it is a bit vague, but if you have tips for further investigation, I 
> would be happy to try and provide more information.
>  
> *Investigation Update (2024-11-21)*
> The problem occured again, and I now have several heap dumps of Artemis 
> nodes. Within these heap dumps, I have seen that:
> +ClusterConnectionImpl State looks good:+
>  * ClusterConnectionImpl.status is started
>  * ClusterConnection.records contains 2 entries
> +Record details state is valid:+
> !image-2024-11-21-11-04-58-242.png|width=655,height=261!
> +*Bridge state does not looks good* and is started but it has no session+
> !image-2024-11-21-11-08-16-869.png|width=1006,height=864!
> The internal producer is marked as closed and its session is stopped.
>  
> Upon reviewing the Artemis code, I found that clearing of the sessionConsumer 
> and the session without altering the state can be triggered by the method 
> BridgeImpl.disconnect() method.
> This method can be invoked following the deletion of the 
> {{$.artemis.internal.sf.artemis-cluster.}} queue performed in 
> ActiveMQServerImpl.undeployAddressesAndQueueNotInConfiguration()
>  
> {code:java}
> AMQ224077: Undeploying queue $.artemis.internal.sf.......... {code}
>  
> I now suspect that the cluster connection failing is triggered by a 
> configuration automatic refresh event.
> Each configuration refresh seems to have a chance of deleting the 
> "$.artemis.internal" queues.
> It is not systematic, during past weeks we had 40 successful config refreshes 
> without the queue being removed.
> Could the AddressSettings or AddressInfo be corrupted, causing this queue to 
> be flagged as removable within 
> ActiveMQServerImpl.undeployAddressesAndQueueNotInConfiguration() method ?
>  
> *Investigation update (2024-11-25)*
> Added a root cause analysis, reproduction scenario and workaround.
>  
> *Grafana visualisation of the depth of notif.* queue when the incident 
> occured:*
>  * primary-1 had 0 cluster-connection nodes
>  * primary-2 had 2 cluster-connection nodes
>  * primary-3 had 1 cluster-connection nodes
> !image-2024-10-08-14-26-51-937.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (ARTEMIS-5086) Cluster connection randomly fails and stop message redistribution

Reply via email to