[ https://issues.apache.org/jira/browse/IGNITE-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611911#comment-14611911 ]
Denis Magda commented on IGNITE-882: ------------------------------------ The issue is found. We have a half-joined node problem here. 1) Node_A wants to join a cluster and sends a join request; 2) Node_B processes this join request and responds to Node_A; 3) Node_A receives the response from Node_B and after that "node add finished" message is propagated to the cluster; 4) Node_A receives "node add finished" but this message is not fully processed by SocketReader (because of scheduling) and Node_A's Thread, that is responsible for joining, sends one more join request when netTimeout expires; 5) In "parallel" with 4) it's decided that Node_A left the ring (cause there were no response from it during some timeout) and Node_A is removed from the ring; 6) Node_B/C/whatever receives the second join request from Node_A and during this attempt the luck is on Node_A's side, it's added to the topology but with the same ID as before. Thinking over a fix. > Node can join twice with the same ID > ------------------------------------ > > Key: IGNITE-882 > URL: https://issues.apache.org/jira/browse/IGNITE-882 > Project: Ignite > Issue Type: Bug > Components: general > Reporter: Semen Boikov > Assignee: Denis Magda > Priority: Critical > Fix For: sprint-7 > > > Observed in the test > 'GridCacheColocatedFailoverSelfTest.testOptimisticRepeatableReadTxConstantTopologyChange': > Node joined: > {noformat} > [15:53:24,163][INFO > ][disco-event-worker-#121%dht.GridCacheColocatedFailoverSelfTest0%][GridDiscoveryManager] > Added new node to topology: TcpDiscoveryNode > [id=10cf7906-50af-4f46-9c31-baf419539001, addrs=[127.0.0.1], > sockAddrs=[/127.0.0.1:47525], discPort=47525, order=400, intOrder=202, > loc=false, ver=1.0.3#19700101-sha1:00000000, isClient=false] > {noformat} > Node failed: > {noformat} > [15:53:24,171][WARN > ][disco-event-worker-#121%dht.GridCacheColocatedFailoverSelfTest0%][GridDiscoveryManager] > Node FAILED: TcpDiscoveryNode [id=10cf7906-50af-4f46-9c31-baf419539001, > addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47525], discPort=47525, order=400, > intOrder=202, loc=false, ver=1.0.3#19700101-sha1:00000000, isClient=false] > {noformat} > This see this message from the thread starting new node: > {noformat} > [15:53:29,047][WARN ][topology-change-thread-1][TcpDiscoverySpi] Node has not > been connected to topology and will repeat join process. Check remote nodes > logs for possible error messages. Note that large topology may require > significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' > configuration property if getting this message on the starting nodes > [networkTimeout=5000] > {noformat} > Node joined again with the same ID: > {noformat} > [15:53:29,212][INFO > ][disco-event-worker-#121%dht.GridCacheColocatedFailoverSelfTest0%][GridDiscoveryManager] > Added new node to topology: TcpDiscoveryNode > [id=10cf7906-50af-4f46-9c31-baf419539001, addrs=[127.0.0.1], > sockAddrs=[/127.0.0.1:47525], discPort=47525, order=404, intOrder=205, > loc=false, ver=1.0.3#19700101-sha1:00000000, isClient=false] > {noformat} > Then test hangs (in the log I see that future mapped on the node > '10cf7906-50af-4f46-9c31-baf419539001' did not finish). > The same issue observed in tests extending > GridCacheAbstractNodeRestartSelfTest. -- This message was sent by Atlassian JIRA (v6.3.4#6332)