[
https://issues.apache.org/jira/browse/ARTEMIS-4305?focusedWorklogId=931117&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-931117
]
ASF GitHub Bot logged work on ARTEMIS-4305:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 21/Aug/24 10:31
Start Date: 21/Aug/24 10:31
Worklog Time Spent: 10m
Work Description: iiliev2 commented on PR #4899:
URL:
https://github.com/apache/activemq-artemis/pull/4899#issuecomment-2301713891
> Is this the kind of message ... you were referring to in the Jira
No, `MessageFlowRecordImpl` becoming bad means that the peer broker holds on
to an instance of it which should have been discarded. We have a workaround
hack in our code which detects when a peer broker is down or its identity
changed, and forces the local broker to evict its message flow record for it.
Something like
```java
this.removeMemberMethod = Topology
.class
.getDeclaredMethod("removeMember", long.class,
String.class, boolean.class);
this.removeMemberMethod.setAccessible(true);
...
removeMemberMethod.invoke(topology, topologyMember.getUniqueEventID() +
1000, topologyMemberId);
```
> I reproduced this with a very simple manual test
I assume you mean you reproduced the WARN message. This is not what we are
trying to fix.
> I would recommend for you ... to use `0 reconnect-attempts` ... given the
fact that persistence is disabled
We were already running that earlier. That resulted in other bugs with the
topology. Can't remember exact details anymore(was a couple of years ago), but
sometimes when a peer came back after the reconnect attempts stopped, it would
no longer be re-admitted in the cluster.
We don't want to go back to a tried configuration which we know didn't work.
> I added your test to the branch with my fix ... but the test still fails
Do you mean it is a false negative(test failed but shouldn't have)? How do
you verify that the topology recovers correctly?
Issue Time Tracking
-------------------
Worklog Id: (was: 931117)
Time Spent: 2h 10m (was: 2h)
> Zero persistence does not work in kubernetes
> --------------------------------------------
>
> Key: ARTEMIS-4305
> URL: https://issues.apache.org/jira/browse/ARTEMIS-4305
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Reporter: Ivan Iliev
> Priority: Major
> Time Spent: 2h 10m
> Remaining Estimate: 0h
>
> In a cluster deployed in kubernetes, when a node is destroyed it terminates
> the process and shuts down the network before the process has a chance to
> close connections. Then a new node might be brought up, reusing the old
> node’s ip. If this happens before the connection ttl, from artemis’ point of
> view, it looks like as if the connection came back. Yet it is actually not
> the same, the peer has a new node id, etc. This messes things up with the
> cluster, the old message flow record is invalid.
> One way to fix it could be if the {{Ping}} messages which are typically used
> to detect dead connections could use some sort of connection id to match that
> the other side is really the one which it is supposed to be.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact