[ 
https://issues.apache.org/jira/browse/IGNITE-4473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897095#comment-15897095
 ] 

Dmitry Karachentsev commented on IGNITE-4473:
---------------------------------------------

1. On exchange when IOException caught and local node is client, is 
IgniteCouldReconnectCheckedException thrown.
2. It's processed in IgniteKernal.start() method and signals that client should 
be reconnected to cluster.
3. For that purpose added rejoin() method to GridDiscoveryManager and 
ClientImpl. It means that client should initiate disconnect from cluster to 
force run all node leave routines, and try to join again. 
4. When start script catches IgniteCouldReconnectCheckedException it calls 
rejoin() and waits on reconnect future. If thrown other exception, node will be 
stopped.
5. This will block user thread on node start and will be released once rejoin 
succeeded.
6. Added method onReconnectFailed() to GridKernalGateway that completes 
reconnect future with exception. This exception will be processed in 
IgniteKernal rejoin loop.
7. ClientImpl.SocketWriter.forceLeave() blocks until node left message will be 
sent (or sending failed) and closes connection to cluster.

Left to do:
Add test and code for the case when client was disconnected from cluster, but 
connection to coordinator wasn't fully restored. Client node should continue 
rejoining unless coordinator become available.

> Client should re-try connection attempt in case of concurrent network failure
> -----------------------------------------------------------------------------
>
>                 Key: IGNITE-4473
>                 URL: https://issues.apache.org/jira/browse/IGNITE-4473
>             Project: Ignite
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.8
>            Reporter: Vladimir Ozerov
>            Assignee: Dmitry Karachentsev
>             Fix For: 2.0
>
>
> *Problem*
> Consider the following scenario:
> 1) Client started, but there are no servers, so it hangs somewhere inside 
> start routine.
> 2) Server appears, and discovery finishes successfully.
> 3) Nodes start talking to each other through communication SPI to finish 
> start process (e.g. to complete exchange).
> 4) But network glitch occurs and server becomes unreachable.
> *Expected behavior*
> Client disconnects and hangs waiting for reconnect.
> *Actual behavior*
> Client throws an exception and never tries to reconnect.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to