[ 
https://issues.apache.org/jira/browse/IGNITE-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13014:
--------------------------------------
    Description: 
For the present, we have duplicated checking of node availability. This 
prolongs node failure detection and gives no additional benefits. There are 
mesh and hardcoded values in this routine.

Let's imagine node 2 doesn't answer any more. Node 1 becomes unable to ping 
node 2 and asks Node 3 to establish permanent connection instead of node 2. 
Despite node 2 has been already pinged within configured timeouts, node 3 try 
to connect to node 2 too. 


Disadvantages:

1)      Possible long detection of node failure up to 
ServerImpl.CON_CHECK_INTERVAL + 2 * IgniteConfiguretion.failureDetectionTimeout 
+ 300ms. See ‘WostCase.txt’

2)      Unexpected, not-configurable decision to check availability of previous 
node based on ‘2 * ServerImpl.CON_CHECK_INTERVAL‘:

// We got message from previous in less than double connection check interval.
boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; 

If ‘ok == true’ node 3 checks node 2.

3)      Double node checking brings several not-configurable hardcoded delays:
Node 3 checks node 2 with hardcoded timeout 100ms:
ServerImpl.isConnectionRefused():

sock.connect(addr, 100);

4) Node 1 marks Node 2 alive anew with hardcoded 200ms. See 
ServerImpl.CrossRingMessageSendState.markLastFailedNodeAlive():
{code:java}
try {
   Thread.sleep(200);
}
catch (InterruptedException e) {
   Thread.currentThread().interrupt();
}
{code}

5) Checking availability of previous node considers any exception but 
ConnectionException (connection refused) as existing connection. Even a 
timeout. See ServerImpl.isConnectionRefused():

{code:java}
 try (Socket sock = new Socket()) {
                sock.connect(addr, 100);
            }
            catch (ConnectException e) {
                return true;
            }
            catch (IOException e) {
                return false; //Consideres as OK.
            }
{code}


  was:
For the present, we have duplicated checking of node availability. This 
prolongs node failure detection and gives no additional benefits. There are 
mesh and hardcoded values in this routine.

Let's imagine node 2 doesn't answer any more. Node 1 becomes unable to ping 
node 2 and asks Node 3 to establish permanent connection instead of node 2. 
Despite node 2 has been already pinged within configured timeouts, node 3 try 
to connect to node 2 too. 


Disadvantages:

1)      Possible long detection of node failure up to 
ServerImpl.CON_CHECK_INTERVAL + 2 * IgniteConfiguretion.failureDetectionTimeout 
+ 300ms. See ‘WostCase.txt’

2)      Unexpected, not-configurable decision to check availability of previous 
node based on ‘2 * ServerImpl.CON_CHECK_INTERVAL‘:

// We got message from previous in less than double connection check interval.
boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; 

If ‘ok == true’ node 3 checks node 2.

3)      Double node checking brings several not-configurable hardcoded delays:
Node 3 checks node 2 with hardcoded timeout 100ms:
ServerImpl.isConnectionRefused():

sock.connect(addr, 100);

Checking availability of previous node considers any exception but 
ConnectionException (connection refused) as existing connection. Even a 
timeout. See ServerImpl.isConnectionRefused().


> Remove double checking of node availability. Fix hardcoded values.
> ------------------------------------------------------------------
>
>                 Key: IGNITE-13014
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13014
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Vladimir Steshin
>            Assignee: Vladimir Steshin
>            Priority: Major
>              Labels: iep-45
>         Attachments: WostCase.txt
>
>
> For the present, we have duplicated checking of node availability. This 
> prolongs node failure detection and gives no additional benefits. There are 
> mesh and hardcoded values in this routine.
> Let's imagine node 2 doesn't answer any more. Node 1 becomes unable to ping 
> node 2 and asks Node 3 to establish permanent connection instead of node 2. 
> Despite node 2 has been already pinged within configured timeouts, node 3 try 
> to connect to node 2 too. 
> Disadvantages:
> 1)    Possible long detection of node failure up to 
> ServerImpl.CON_CHECK_INTERVAL + 2 * 
> IgniteConfiguretion.failureDetectionTimeout + 300ms. See ‘WostCase.txt’
> 2)    Unexpected, not-configurable decision to check availability of previous 
> node based on ‘2 * ServerImpl.CON_CHECK_INTERVAL‘:
> // We got message from previous in less than double connection check interval.
> boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; 
> If ‘ok == true’ node 3 checks node 2.
> 3)    Double node checking brings several not-configurable hardcoded delays:
> Node 3 checks node 2 with hardcoded timeout 100ms:
> ServerImpl.isConnectionRefused():
> sock.connect(addr, 100);
> 4) Node 1 marks Node 2 alive anew with hardcoded 200ms. See 
> ServerImpl.CrossRingMessageSendState.markLastFailedNodeAlive():
> {code:java}
> try {
>    Thread.sleep(200);
> }
> catch (InterruptedException e) {
>    Thread.currentThread().interrupt();
> }
> {code}
> 5) Checking availability of previous node considers any exception but 
> ConnectionException (connection refused) as existing connection. Even a 
> timeout. See ServerImpl.isConnectionRefused():
> {code:java}
>  try (Socket sock = new Socket()) {
>                 sock.connect(addr, 100);
>             }
>             catch (ConnectException e) {
>                 return true;
>             }
>             catch (IOException e) {
>                 return false; //Consideres as OK.
>             }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to