[ 
https://issues.apache.org/jira/browse/ARTEMIS-4305?focusedWorklogId=931587&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-931587
 ]

ASF GitHub Bot logged work on ARTEMIS-4305:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 23/Aug/24 21:33
            Start Date: 23/Aug/24 21:33
    Worklog Time Spent: 10m 
      Work Description: jbertram commented on PR #4899:
URL: 
https://github.com/apache/activemq-artemis/pull/4899#issuecomment-2307838297

   I spoke at length with Ivan on Slack yesterday about this issue, and I 
wanted to summarize my thoughts here for posterity's sake.
   
   The _real_ issue that this PR is attempting to address is related to a quirk 
(for lack of a better word) in how TCP can sometimes work in a containerized 
environment. In short, it's possible for a TCP connection to be closed on one 
side without the other side receiving the appropriate `RST`. This is described 
in more detail 
[here](https://blog.box.com/container-networking-mystery-missing-rsts).
   
   The use-case here involves a cluster of embedded brokers each running on 
separate K8s pods without persistence. If one of those pods is restarted by K8s 
using the same IP address then the "same" broker rejoins the cluster with a new 
identity (since the node ID is not persisted and reused) and when this TCP 
"quirk" happens the other clustered brokers don't clean up their cluster 
connection. Their TCP connection to the old broker apparently just becomes a 
connection to the new broker. This, of course, breaks things in weird ways with 
the internal state (e.g. `MessageFlowRecord`) of the cluster connection.
   
   That said, the test on this PR **is not testing this use-case**. Instead it 
is testing a _similar but technically different_ use-case where one node in a 
cluster is restarted with a new identity (i.e. node ID) on the same IP address. 
If the other nodes have `reconnect-attempts` > `0` on their 
`cluster-connection` they will reconnect to this node and things will break 
internally due to mismatched state. The solution to this problem is simply to 
use `0` for `reconnect-attempts`. No code changes are required. This is, of 
course, what I've recommended in previous comments. It took a conversation in 
Slack to clarify the actual use-case.
   
   Ultimately, I do not believe the test on this PR is valid for the real 
problem that needs to be solved (i.e. the missing TCP `RST`). Furthermore, I'm 
not sure attempting to fix this problem at the level of the broker is 
appropriate. It seems to me that solving it at the network or infrastructure 
level would be more effective and wouldn't burden _every other user_ with a 
protocol change for this edge case.
   
   At this point I'm closing this PR as I don't see a future for it here.




Issue Time Tracking
-------------------

    Worklog Id:     (was: 931587)
    Time Spent: 3h 10m  (was: 3h)

> Zero persistence does not work in kubernetes
> --------------------------------------------
>
>                 Key: ARTEMIS-4305
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-4305
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>            Reporter: Ivan Iliev
>            Priority: Major
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> In a cluster deployed in kubernetes, when a node is destroyed it terminates 
> the process and shuts down the network before the process has a chance to 
> close connections. Then a new node might be brought up, reusing the old 
> node’s ip. If this happens before the connection ttl, from artemis’ point of 
> view, it looks like as if the connection came back. Yet it is actually not 
> the same, the peer has a new node id, etc. This messes things up with the 
> cluster, the old message flow record is invalid.
> One way to fix it could be if the {{Ping}} messages which are typically used 
> to detect dead connections could use some sort of connection id to match that 
> the other side is really the one which it is supposed to be.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact


Reply via email to