[
https://issues.apache.org/jira/browse/ARTEMIS-5086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18080246#comment-18080246
]
Marcin Wołoszczuk commented on ARTEMIS-5086:
--------------------------------------------
[~jpbriquet] did you finally manage to solve this issue? I ran into it
recently, didn't get any sensible support from the team on the Slack channel. I
tried what you've said, disabled auto queue deletion, but I also statically
define addresses + queues, this seems to have solved the issue with messages
not being redistributed. I also validate this by inspecting Bindings / Remote
Bindings via JMX in the web console. For auto created/deleted queues it used to
go blank / only local node. Now it retains all 3 nodes.
> Cluster connection randomly fails and stop message redistribution
> -----------------------------------------------------------------
>
> Key: ARTEMIS-5086
> URL: https://issues.apache.org/jira/browse/ARTEMIS-5086
> Project: Artemis
> Issue Type: Bug
> Components: Broker, Clustering
> Affects Versions: 2.30.0, 2.35.0, 2.36.0
> Reporter: Jean-Pascal Briquet
> Priority: Major
> Attachments: address-settings.xml, cluster-connections-stop.log,
> image-2024-10-08-14-26-51-937.png, image-2024-11-21-11-04-58-242.png,
> image-2024-11-21-11-08-16-869.png, message-events-during-incident-1.log,
> pr21-broker.xml
>
>
> h3. *Context*
> In a cluster of 3 primary/backup pairs, it can happen that the cluster
> connection randomly fails and does not automatically recover.
> The frequency of the problem is random and it can happens once every few
> weeks.
> When cluster-connectivity is degraded, it stops the message flow between
> brokers and interrupts the message redistribution.
> Not all cluster nodes may be affected, some may still maintain
> cluster-connectivity, while others are partially affected, and some can lose
> all connectivity.
> There are no errors visible in logs when the issue occurs.
> h3. *Workaround*
> +Disable auto-deletion+
> Set config-delete-addresses and config-delete-queues to OFF in address
> settings configuration.
> Remove unneeded queues via JMX or through the administration console, until a
> correction is available.
>
> +Restarting cluster-connections+
> The flow can be restored if an operator stop and start the cluster-connection
> from the JMX management.
> This means that the message redistribution can be interrupted for a
> potentially long time until it is manually restarted.
> h3. *How to recognize the problem*
> The cluster-connections JMX panel indicates that:
> - cluster-connectivity is started
> - topology is correct and contains all nodes (3 members, 6 nodes)
> - nodes fields is either empty, or contains only one entry (instead of two
> when everything works). In my opinion, this is the main indicator, when it
> works well, nodes should be equal to "members in topology - 1"
> The following log appears and is related to the deletion of the
> cluster-connection queue during a configuration hot-reload :
>
> {code:java}
> AMQ224077: Undeploying queue $.artemis.internal.sf.......... {code}
>
> h3. *Consequences*
> - Messages are stuck in {{$.artemis.internal.sf.artemis-cluster.*}} queues
> until the cluster connection is restarted.
> - Messages are stuck in {{notif.*}} queues until the cluster connection is
> restarted
> - Consumers are starved as message redistribution is broken
> - Messages stuck are lost when the cluster-connection is restarted
>
> h3. *Reproduction scenarios*
> +Configuration+
> - Artemis cluster with node-1, node-2 and node-3
> - Configuration reload enabled
> {code:java}
> <configuration-file-refresh-period>5000</configuration-file-refresh-period>{code}
> - Address-settings containing automatically removable queue configuration
> {code:java}
> <address-setting match="queue.#">
> <config-delete-addresses>FORCE</config-delete-addresses>
> <config-delete-queues>FORCE</config-delete-queues>
> </address-setting>{code}
> - Addresses defined
> {code:java}
> <addresses xmlns="urn:activemq:core">
> <address name="queue.A">
> <anycast>
> <queue name="queue.A"/>
> </anycast>
> </address>
> <address name="queue.B">
> <anycast>
> <queue name="queue.B"/>
> </anycast>
> </address>
> </addresses>{code}
>
> *+Cluster-connection broken reproduction scenario :+*
> * Start the Artemis cluster and node-1, node-2 and node-3
> * Remove the "queue.B" address and associated queue from configuration file.
> * Touch broker.xml (if config is managed externally) to trigger reload
> * Upon configuration is reloaded:
> ** $.artemis.internal.sf queues are removed
> ** cluster-connection bridges are disconnected and enter an inconsistent
> state (refer to the investigation section below for details).
> +Logs:+
> {code:java}
> 2024-11-25 08:14:43,772 INFO [org.apache.activemq.artemis.core.server]
> AMQ221056: Reloading configuration: addresses
> 2024-11-25 08:14:43,773 INFO [org.apache.activemq.artemis.core.server]
> AMQ224077: Undeploying queue
> $.artemis.internal.sf.artemis-cluster.2fe52843-a828-11ef-8368-0242ac14001f
> 2024-11-25 08:14:43,786 INFO [org.apache.activemq.artemis.core.server]
> AMQ224077: Undeploying queue
> $.artemis.internal.sf.artemis-cluster.30699af5-a828-11ef-bbe0-0242ac140015
> 2024-11-25 08:14:43,796 INFO [org.apache.activemq.artemis.core.server]
> AMQ224077: Undeploying queue queue.B
> 2024-11-25 08:14:43,802 INFO [org.apache.activemq.artemis.core.server]
> AMQ224076: Undeploying address queue.B{code}
>
> +*Message loss reproduction scenario:*+
> * Create a consumer connected to queue.A on node-2
> * Follow the same staeps as in the "Cluster-connection broken" case
> * New messages sent toward queue.A on node-1 are immediately redistributed
> and acknowledged.
> * These messages will never arrive to node-2, and are not in local node-1
> queues, leading to messages loss.
>
> h3. *Root cause analysis*
> The configuration reload is removing $.artemis.internal.sf queues by error
> when an address is deleted from configuration.
> This action triggers the disconnect() method on the cluster connection
> bridge, leaving it in an inconsistent state and causing message loss.
> Call stack:
> {code:java}
> ActiveMQServerImpl.undeployAddressesAndQueueNotInConfiguration
> -> ActiveMQServerImpl.listQueues(address)
> ---> PostOfficeImpl.listQueuesForAddress{code}
> listQueuesForAddress returns queues based on local bindings AND remote
> bindings :
> {code:java}
> 0 = {QueueImpl@10860}
> "QueueImpl[name=$.artemis.internal.sf.artemis-cluster.30699af5-a828-11ef-bbe0-0242ac140015,
> postOffice=PostOfficeImpl
> [server=ActiveMQServerImpl::name=artemis-dc1-primary-1], temp=false]@77dd8218"
> 1 = {QueueImpl@12358} "QueueImpl[name=queue.B, postOffice=PostOfficeImpl
> [server=ActiveMQServerImpl::name=artemis-dc1-primary-1], temp=false]@7d207349"
> 2 = {QueueImpl@12359}
> "QueueImpl[name=$.artemis.internal.sf.artemis-cluster.2fe52843-a828-11ef-8368-0242ac14001f,
> postOffice=PostOfficeImpl
> [server=ActiveMQServerImpl::name=artemis-dc1-primary-1],
> temp=false]@6e1ab32f"{code}
> *Possible fixes:*
> +Filter SNF queues+
> - Cluster snf queues should be filtered from listQueuesForAddress results.
> This will prevent the configuration reload process from removing these queues.
> +Bridge robustness+
> The "disconnect()" method set the bridge into an inconsistent state.
> Ideally, properly update bridge state to stopped and trigger a reconnection
> attempt to restore bridge behaviour.
>
> h3. *Investigation*
> h4. Investigation (2024-10-03)
> Since I don't have a clear reproduction scenario, I checked the code to
> understand when the {{ClusterConnectionImpl.getNodes()}} could return an
> empty list.
> It seems that nodes are not listed when:
> - record list is empty, or
> - record list has elements but session is null, or
> - record list has elements but forward connection is null
> During the last incident, we have enabled TRACE level on:
> * {{org.apache.activemq.artemis.core.server.cluster}}
> * {{org.apache.activemq.artemis.core.client}}
> When we performed the stop operation on cluster-connections the traces
> indicated that:
> - record list had two entries (2 bridges, which is good)
> - {{session}} had a value (not sure about {{{}sessionConsumer{}}})
> - forward connection is the last element that could be null
> These stop traces are provided in attachment, if you want to review them.
> Based on that, I believe the list was empty because: "forward connection was
> null".
> The {{getNode}} contains a specific null check for the forward connection, so
> it seems that this null state can occur occasionally. When could it happen?
> I would expect the bridge auto-reconnection logic to restore the connection,
> but it does not seems to detect it as it never recover.
> Sorry it is a bit vague, but if you have tips for further investigation, I
> would be happy to try and provide more information.
>
> *Investigation Update (2024-11-21)*
> The problem occured again, and I now have several heap dumps of Artemis
> nodes. Within these heap dumps, I have seen that:
> +ClusterConnectionImpl State looks good:+
> * ClusterConnectionImpl.status is started
> * ClusterConnection.records contains 2 entries
> +Record details state is valid:+
> !image-2024-11-21-11-04-58-242.png|width=655,height=261!
> +*Bridge state does not looks good* and is started but it has no session+
> !image-2024-11-21-11-08-16-869.png|width=1006,height=864!
> The internal producer is marked as closed and its session is stopped.
>
> Upon reviewing the Artemis code, I found that clearing of the sessionConsumer
> and the session without altering the state can be triggered by the method
> BridgeImpl.disconnect() method.
> This method can be invoked following the deletion of the
> {{$.artemis.internal.sf.artemis-cluster.}} queue performed in
> ActiveMQServerImpl.undeployAddressesAndQueueNotInConfiguration()
>
> {code:java}
> AMQ224077: Undeploying queue $.artemis.internal.sf.......... {code}
>
> I now suspect that the cluster connection failing is triggered by a
> configuration automatic refresh event.
> Each configuration refresh seems to have a chance of deleting the
> "$.artemis.internal" queues.
> It is not systematic, during past weeks we had 40 successful config refreshes
> without the queue being removed.
> Could the AddressSettings or AddressInfo be corrupted, causing this queue to
> be flagged as removable within
> ActiveMQServerImpl.undeployAddressesAndQueueNotInConfiguration() method ?
>
> *Investigation update (2024-11-25)*
> Added a root cause analysis, reproduction scenario and workaround.
>
> *Grafana visualisation of the depth of notif.* queue when the incident
> occured:*
> * primary-1 had 0 cluster-connection nodes
> * primary-2 had 2 cluster-connection nodes
> * primary-3 had 1 cluster-connection nodes
> !image-2024-10-08-14-26-51-937.png!
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]