[
https://issues.apache.org/jira/browse/ARTEMIS-4305?focusedWorklogId=931028&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-931028
]
ASF GitHub Bot logged work on ARTEMIS-4305:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 20/Aug/24 20:46
Start Date: 20/Aug/24 20:46
Worklog Time Spent: 10m
Work Description: jbertram commented on PR #4899:
URL:
https://github.com/apache/activemq-artemis/pull/4899#issuecomment-2299734904
I added your test to the branch with my fix, and I can see my fix detecting
a problem and closing the connection, but the test still fails, and I still see
messages like this:
```
WARN [org.apache.activemq.artemis.core.server] AMQ222139:
MessageFlowRecordImpl [nodeID=13207315-5f2f-11ef-b63b-5c80b6f32172,
connector=TransportConfiguration(name=netty-connector,
factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory)?port=61616&host=localhost,
queueName=$.artemis.internal.sf.my-cluster.13207315-5f2f-11ef-b63b-5c80b6f32172,
queue=QueueImpl[name=$.artemis.internal.sf.my-cluster.13207315-5f2f-11ef-b63b-5c80b6f32172,
postOffice=PostOfficeImpl [server=ActiveMQServerImpl::name=localhost],
temp=false]@3ca984c7, isClosed=false, reset=true]::Remote queue binding
exampleQueue2dc45dca-5f2f-11ef-b5ea-5c80b6f32172 has already been bound in the
post office. Most likely cause for this is you have a loop in your cluster due
to cluster max-hops being too large or you have multiple cluster connections to
the same nodes using overlapping addresses
```
Is this the kind of message you see in your K8s cluster when this problem
occurs and is that what you were referring to in the Jira when you said this?
> This messes things up with the cluster, the old message flow record is
invalid.
I reproduced this with a very simple manual test with 2 clustered nodes with
persistence disabled. When I kill one node and restart it I see the `AMQ222139`
message on the _other_ node. However, I resolved this by simply changing the
configuration on the `cluster-connection` using:
```
<reconnect-attempts>0</reconnect-attemtps>
```
I then cherry-picked your `ZeroPersistenceSymmetricalClusterTest` test to
the `main` branch. The test fails by default, but when I change the various
`broker.xml` files used by that test to use `0` `reconnect-attempts` the test
passes. Also, given the fact that persistence is disabled this is the
configuration I would recommend. Have you considered this configuration change
in your environment? It seems this would resolve your problem with no code
changes necessary.
Issue Time Tracking
-------------------
Worklog Id: (was: 931028)
Time Spent: 2h (was: 1h 50m)
> Zero persistence does not work in kubernetes
> --------------------------------------------
>
> Key: ARTEMIS-4305
> URL: https://issues.apache.org/jira/browse/ARTEMIS-4305
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Reporter: Ivan Iliev
> Priority: Major
> Time Spent: 2h
> Remaining Estimate: 0h
>
> In a cluster deployed in kubernetes, when a node is destroyed it terminates
> the process and shuts down the network before the process has a chance to
> close connections. Then a new node might be brought up, reusing the old
> node’s ip. If this happens before the connection ttl, from artemis’ point of
> view, it looks like as if the connection came back. Yet it is actually not
> the same, the peer has a new node id, etc. This messes things up with the
> cluster, the old message flow record is invalid.
> One way to fix it could be if the {{Ping}} messages which are typically used
> to detect dead connections could use some sort of connection id to match that
> the other side is really the one which it is supposed to be.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact