[ https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir Steshin updated IGNITE-13016: -------------------------------------- Description: We should fix 3 drawbacks in the backward checking of failed node: 1) We should replace hardcoded timeout 100ms with a parameter like failureDetectionTimeout: {code:java} private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { ... sock.connect(addr, 100); ... } {code} 2) Maximal interval to check previous node should be reconsidered. It should rely on configurable param like failureDetectionTimeout: {code:java} TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...); ... // We got message from previous in less than double connection check interval. boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; //Why '2 * CON_CHECK_INTERVAL', not a failureDetectionTimeout? if (ok) { // Check case when previous node suddenly died. This will speed up // node failing. ... } res.previousNodeAlive(ok); {code} 3) Any negative result of the connection checking should be considered as node failed. Currently, we look only at refused connection. Any other exceptions, including a timeout, are treated as living connection: {code:java} private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { ... catch (ConnectException e) { return true; } catch (IOException e) { return false;//Why a timeout doesn't mean lost connection? } return false; } {code} was: We should fix 3 drawbacks in the backward checking of failed node: 1) We should replace hardcoded timeout 100ms with a parameter like failureDetectionTimeout: {code:java} private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { ... sock.connect(addr, 100); ... } {code} 2) Maximal interval to check previous node should be reconsidered. It should rely on configurable param: {code:java} TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...); ... // We got message from previous in less than double connection check interval. boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; //Why '2 * CON_CHECK_INTERVAL', not a failureDetectionTimeout. if (ok) { // Check case when previous node suddenly died. This will speed up // node failing. ... } res.previousNodeAlive(ok); {code} 3) Any negative result of the connection checking should be considered as node failed. Currently, we look only at refused connection. Any other exceptions, including a timeout, are treated as living connection: {code:java} private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { ... catch (ConnectException e) { return true; } catch (IOException e) { return false;//Why a timeout doesn't mean lost connection? } return false; } {code} > Fix backward checking of failed node. > ------------------------------------- > > Key: IGNITE-13016 > URL: https://issues.apache.org/jira/browse/IGNITE-13016 > Project: Ignite > Issue Type: Sub-task > Reporter: Vladimir Steshin > Assignee: Vladimir Steshin > Priority: Major > Labels: iep-45 > > We should fix 3 drawbacks in the backward checking of failed node: > 1) We should replace hardcoded timeout 100ms with a parameter like > failureDetectionTimeout: > {code:java} > private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { > ... > sock.connect(addr, 100); > ... > } > {code} > 2) Maximal interval to check previous node should be reconsidered. It should > rely on configurable param like failureDetectionTimeout: > {code:java} > TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...); > ... > // We got message from previous in less than double connection check > interval. > boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; //Why '2 * > CON_CHECK_INTERVAL', not a failureDetectionTimeout? > if (ok) { > // Check case when previous node suddenly died. This will speed up > // node failing. > ... > } > res.previousNodeAlive(ok); > {code} > 3) Any negative result of the connection checking should be considered as > node failed. Currently, we look only at refused connection. Any other > exceptions, including a timeout, are treated as living connection: > {code:java} > private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { > ... > catch (ConnectException e) { > return true; > } > catch (IOException e) { > return false;//Why a timeout doesn't mean lost connection? > } > return false; > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)