jbertram commented on PR #4899: URL: https://github.com/apache/activemq-artemis/pull/4899#issuecomment-2307838297
I spoke at length with Ivan on Slack yesterday about this issue, and I wanted to summarize my thoughts here for posterity's sake. The _real_ issue that this PR is attempting to address is related to a quirk (for lack of a better word) in how TCP can sometimes work in a containerized environment. In short, it's possible for a TCP connection to be closed on one side without the other side receiving the appropriate `RST`. This is described in more detail [here](https://blog.box.com/container-networking-mystery-missing-rsts). The use-case here involves a cluster of embedded brokers each running on separate K8s pods without persistence. If one of those pods is restarted by K8s using the same IP address then the "same" broker rejoins the cluster with a new identity (since the node ID is not persisted and reused) and when this TCP "quirk" happens the other clustered brokers don't clean up their cluster connection. Their TCP connection to the old broker apparently just becomes a connection to the new broker. This, of course, breaks things in weird ways with the internal state (e.g. `MessageFlowRecord`) of the cluster connection. That said, the test on this PR **is not testing this use-case**. Instead it is testing a _similar but technically different_ use-case where one node in a cluster is restarted with a new identity (i.e. node ID) on the same IP address. If the other nodes have `reconnect-attempts` > `0` on their `cluster-connection` they will reconnect to this node and things will break internally due to mismatched state. The solution to this problem is simply to use `0` for `reconnect-attempts`. No code changes are required. This is, of course, what I've recommended in previous comments. It took a conversation in Slack to clarify the actual use-case. Ultimately, I do not believe the test on this PR is valid for the real problem that needs to be solved (i.e. the missing TCP `RST`). Furthermore, I'm not sure attempting to fix this problem at the level of the broker is appropriate. It seems to me that solving it at the network or infrastructure level would be more effective and wouldn't burden _every other user_ with a protocol change for this edge case. At this point I'm closing this PR as I don't see a future for it here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected] For further information, visit: https://activemq.apache.org/contact
