jbertram commented on PR #4899:
URL: 
https://github.com/apache/activemq-artemis/pull/4899#issuecomment-2307838297

   I spoke at length with Ivan on Slack yesterday about this issue, and I 
wanted to summarize my thoughts here for posterity's sake.
   
   The _real_ issue that this PR is attempting to address is related to a quirk 
(for lack of a better word) in how TCP can sometimes work in a containerized 
environment. In short, it's possible for a TCP connection to be closed on one 
side without the other side receiving the appropriate `RST`. This is described 
in more detail 
[here](https://blog.box.com/container-networking-mystery-missing-rsts).
   
   The use-case here involves a cluster of embedded brokers each running on 
separate K8s pods without persistence. If one of those pods is restarted by K8s 
using the same IP address then the "same" broker rejoins the cluster with a new 
identity (since the node ID is not persisted and reused) and when this TCP 
"quirk" happens the other clustered brokers don't clean up their cluster 
connection. Their TCP connection to the old broker apparently just becomes a 
connection to the new broker. This, of course, breaks things in weird ways with 
the internal state (e.g. `MessageFlowRecord`) of the cluster connection.
   
   That said, the test on this PR **is not testing this use-case**. Instead it 
is testing a _similar but technically different_ use-case where one node in a 
cluster is restarted with a new identity (i.e. node ID) on the same IP address. 
If the other nodes have `reconnect-attempts` > `0` on their 
`cluster-connection` they will reconnect to this node and things will break 
internally due to mismatched state. The solution to this problem is simply to 
use `0` for `reconnect-attempts`. No code changes are required. This is, of 
course, what I've recommended in previous comments. It took a conversation in 
Slack to clarify the actual use-case.
   
   Ultimately, I do not believe the test on this PR is valid for the real 
problem that needs to be solved (i.e. the missing TCP `RST`). Furthermore, I'm 
not sure attempting to fix this problem at the level of the broker is 
appropriate. It seems to me that solving it at the network or infrastructure 
level would be more effective and wouldn't burden _every other user_ with a 
protocol change for this edge case.
   
   At this point I'm closing this PR as I don't see a future for it here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact


Reply via email to