[ 
https://issues.apache.org/jira/browse/ARTEMIS-4305?focusedWorklogId=931117&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-931117
 ]

ASF GitHub Bot logged work on ARTEMIS-4305:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 21/Aug/24 10:31
            Start Date: 21/Aug/24 10:31
    Worklog Time Spent: 10m 
      Work Description: iiliev2 commented on PR #4899:
URL: 
https://github.com/apache/activemq-artemis/pull/4899#issuecomment-2301713891

   > Is this the kind of message ... you were referring to in the Jira
   
   No, `MessageFlowRecordImpl` becoming bad means that the peer broker holds on 
to an instance of it which should have been discarded. We have a workaround 
hack in our code which detects when a peer broker is down or its identity 
changed, and forces the local broker to evict its message flow record for it. 
Something like
   ```java
    this.removeMemberMethod = Topology
                           .class
                           .getDeclaredMethod("removeMember", long.class, 
String.class, boolean.class);
                   this.removeMemberMethod.setAccessible(true);
   ...
   removeMemberMethod.invoke(topology, topologyMember.getUniqueEventID() + 
1000, topologyMemberId);
   ```
   > I reproduced this with a very simple manual test
   
   I assume you mean you reproduced the WARN message. This is not what we are 
trying to fix.
   > I would recommend for you ... to use `0 reconnect-attempts` ... given the 
fact that persistence is disabled
   
   We were already running that earlier. That resulted in other bugs with the 
topology. Can't remember exact details anymore(was a couple of years ago), but 
sometimes when a peer came back after the reconnect attempts stopped, it would 
no longer be re-admitted in the cluster.
   We don't want to go back to a tried configuration which we know didn't work.
   > I added your test to the branch with my fix ... but the test still fails
   
   Do you mean it is a false negative(test failed but shouldn't have)? How do 
you verify that the topology recovers correctly?
   




Issue Time Tracking
-------------------

    Worklog Id:     (was: 931117)
    Time Spent: 2h 10m  (was: 2h)

> Zero persistence does not work in kubernetes
> --------------------------------------------
>
>                 Key: ARTEMIS-4305
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-4305
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>            Reporter: Ivan Iliev
>            Priority: Major
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> In a cluster deployed in kubernetes, when a node is destroyed it terminates 
> the process and shuts down the network before the process has a chance to 
> close connections. Then a new node might be brought up, reusing the old 
> node’s ip. If this happens before the connection ttl, from artemis’ point of 
> view, it looks like as if the connection came back. Yet it is actually not 
> the same, the peer has a new node id, etc. This messes things up with the 
> cluster, the old message flow record is invalid.
> One way to fix it could be if the {{Ping}} messages which are typically used 
> to detect dead connections could use some sort of connection id to match that 
> the other side is really the one which it is supposed to be.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact


Reply via email to