[ 
https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13016:
--------------------------------------
    Description: 
We should fix 3 drawbacks in the backward checking of failed node:

1) We should replace hardcoded timeout 100ms with a parameter like 
failureDetectionTimeout:
{code:java}
private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
   ...
    sock.connect(addr, 100);
   ...
}
{code}


2) Maximal interval to check previous node should be reconsidered. It should 
rely on configurable param like failureDetectionTimeout:
{code:java}
   TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...);
   ...
   // We got message from previous in less than double connection check 
interval.
   boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; //Why '2 * 
CON_CHECK_INTERVAL', not a failureDetectionTimeout?

   if (ok) {
      // Check case when previous node suddenly died. This will speed up
      // node failing.
      ...
    }

    res.previousNodeAlive(ok);
{code}


3) Any negative result of the connection checking should be considered as node 
failed. Currently, we look only at refused connection. Any other exceptions, 
including a timeout, are treated as living connection: 

{code:java}
private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
   ...
   catch (ConnectException e) {
      return true;
   }
   catch (IOException e) {
      return false;//Why a timeout doesn't mean lost connection?
   }

   return false;
}
{code}


  was:
We should fix 3 drawbacks in the backward checking of failed node:

1) We should replace hardcoded timeout 100ms with a parameter like 
failureDetectionTimeout:
{code:java}
private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
   ...
    sock.connect(addr, 100);
   ...
}
{code}


2) Maximal interval to check previous node should be reconsidered. It should 
rely on configurable param:
{code:java}
   TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...);
   ...
   // We got message from previous in less than double connection check 
interval.
   boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; //Why '2 * 
CON_CHECK_INTERVAL', not a failureDetectionTimeout.

   if (ok) {
      // Check case when previous node suddenly died. This will speed up
      // node failing.
      ...
    }

    res.previousNodeAlive(ok);
{code}


3) Any negative result of the connection checking should be considered as node 
failed. Currently, we look only at refused connection. Any other exceptions, 
including a timeout, are treated as living connection: 

{code:java}
private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
   ...
   catch (ConnectException e) {
      return true;
   }
   catch (IOException e) {
      return false;//Why a timeout doesn't mean lost connection?
   }

   return false;
}
{code}



> Fix backward checking of failed node.
> -------------------------------------
>
>                 Key: IGNITE-13016
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13016
>             Project: Ignite
>          Issue Type: Sub-task
>            Reporter: Vladimir Steshin
>            Assignee: Vladimir Steshin
>            Priority: Major
>              Labels: iep-45
>
> We should fix 3 drawbacks in the backward checking of failed node:
> 1) We should replace hardcoded timeout 100ms with a parameter like 
> failureDetectionTimeout:
> {code:java}
> private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
>    ...
>     sock.connect(addr, 100);
>    ...
> }
> {code}
> 2) Maximal interval to check previous node should be reconsidered. It should 
> rely on configurable param like failureDetectionTimeout:
> {code:java}
>    TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...);
>    ...
>    // We got message from previous in less than double connection check 
> interval.
>    boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; //Why '2 * 
> CON_CHECK_INTERVAL', not a failureDetectionTimeout?
>    if (ok) {
>       // Check case when previous node suddenly died. This will speed up
>       // node failing.
>       ...
>     }
>     res.previousNodeAlive(ok);
> {code}
> 3) Any negative result of the connection checking should be considered as 
> node failed. Currently, we look only at refused connection. Any other 
> exceptions, including a timeout, are treated as living connection: 
> {code:java}
> private boolean ServerImpl.isConnectionRefused(SocketAddress addr) {
>    ...
>    catch (ConnectException e) {
>       return true;
>    }
>    catch (IOException e) {
>       return false;//Why a timeout doesn't mean lost connection?
>    }
>    return false;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to