[ 
https://issues.apache.org/jira/browse/IGNITE-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611911#comment-14611911
 ] 

Denis Magda commented on IGNITE-882:
------------------------------------

The issue is found. 

We have a half-joined node problem here.

1) Node_A wants to join a cluster and sends a join request;
2) Node_B processes this join request and responds to Node_A;
3) Node_A receives the response from Node_B and after that "node add finished" 
message is propagated to the cluster;
4) Node_A receives "node add finished" but this message is not fully processed 
by SocketReader (because of scheduling) and Node_A's Thread, that is 
responsible for joining, sends one more join request when netTimeout expires;
5) In "parallel" with 4) it's decided that Node_A left the ring (cause there 
were no response from it during some timeout) and Node_A is removed from the 
ring;
6) Node_B/C/whatever receives the second join request from Node_A and during 
this attempt the luck is on Node_A's side, it's added to the topology but with 
the same ID as before. 

Thinking over a fix.

> Node can join twice with the same ID
> ------------------------------------
>
>                 Key: IGNITE-882
>                 URL: https://issues.apache.org/jira/browse/IGNITE-882
>             Project: Ignite
>          Issue Type: Bug
>          Components: general
>            Reporter: Semen Boikov
>            Assignee: Denis Magda
>            Priority: Critical
>             Fix For: sprint-7
>
>
> Observed in the test 
> 'GridCacheColocatedFailoverSelfTest.testOptimisticRepeatableReadTxConstantTopologyChange':
> Node joined:
> {noformat}
> [15:53:24,163][INFO 
> ][disco-event-worker-#121%dht.GridCacheColocatedFailoverSelfTest0%][GridDiscoveryManager]
>  Added new node to topology: TcpDiscoveryNode 
> [id=10cf7906-50af-4f46-9c31-baf419539001, addrs=[127.0.0.1], 
> sockAddrs=[/127.0.0.1:47525], discPort=47525, order=400, intOrder=202, 
> loc=false, ver=1.0.3#19700101-sha1:00000000, isClient=false]
> {noformat}
> Node failed:
> {noformat}
> [15:53:24,171][WARN 
> ][disco-event-worker-#121%dht.GridCacheColocatedFailoverSelfTest0%][GridDiscoveryManager]
>  Node FAILED: TcpDiscoveryNode [id=10cf7906-50af-4f46-9c31-baf419539001, 
> addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47525], discPort=47525, order=400, 
> intOrder=202, loc=false, ver=1.0.3#19700101-sha1:00000000, isClient=false]
> {noformat}
> This see this message from the thread starting new node:
> {noformat}
> [15:53:29,047][WARN ][topology-change-thread-1][TcpDiscoverySpi] Node has not 
> been connected to topology and will repeat join process. Check remote nodes 
> logs for possible error messages. Note that large topology may require 
> significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' 
> configuration property if getting this message on the starting nodes 
> [networkTimeout=5000]
> {noformat}
> Node joined again with the same ID:
> {noformat}
> [15:53:29,212][INFO 
> ][disco-event-worker-#121%dht.GridCacheColocatedFailoverSelfTest0%][GridDiscoveryManager]
>  Added new node to topology: TcpDiscoveryNode 
> [id=10cf7906-50af-4f46-9c31-baf419539001, addrs=[127.0.0.1], 
> sockAddrs=[/127.0.0.1:47525], discPort=47525, order=404, intOrder=205, 
> loc=false, ver=1.0.3#19700101-sha1:00000000, isClient=false]
> {noformat}
> Then test hangs (in the log I see that future mapped on the node 
> '10cf7906-50af-4f46-9c31-baf419539001' did not finish).
> The same issue observed in tests extending 
> GridCacheAbstractNodeRestartSelfTest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to