[ https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir Steshin updated IGNITE-13016: -------------------------------------- Description: We should fix several drawbacks in the backward checking of failed node. They prolong node failure detection upto: ServerImpl.CON_CHECK_INTERVAL + 2 * IgniteConfiguretion.failureDetectionTimeout + 300ms. See: * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' which emulates long answears on a failed node and measures failure detection delays. * '_FailureDetectionResearch.txt_' - results of the test. * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix. * '_WostCaseStepByStep.txt_' - description how the worst case happens. *Suggestions:* 1) We should replace hardcoded timeout 100ms with a parameter like failureDetectionTimeout: {code:java} private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { ... sock.connect(addr, 100); // Make it rely on failureDetectionTimeout. ... } {code} 2) Any negative result of the connection checking should be considered as node failed. Currently, we look only at refused connection. Any other exceptions, including a timeout, are treated as living connection: {code:java} private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { ... catch (ConnectException e) { return true; } catch (IOException e) { return false; // Make any error mean lost connection. } return false; } {code} 3) Maximal interval to check previous node should rely on actual failure detection timeout: {code:java} TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...); ... // We got message from previous in less than double connection check interval. boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; // Here should be a timeout of failure detection. if (ok) { // Check case when previous node suddenly died. This will speed up // node failing. ... } res.previousNodeAlive(ok); {code} was: We should fix several drawbacks in the backward checking of failed node. They prolongs node failure detection upto: ServerImpl.CON_CHECK_INTERVAL + 2 * IgniteConfiguretion.failureDetectionTimeout + 300ms. See: * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' which emulates long answears on a failed node and measures failure detection delays. * '_FailureDetectionResearch.txt_' - results of the test. * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix. * '_WostCaseStepByStep.txt_' - description how the worst case happens. *Suggestions:* 1) We should replace hardcoded timeout 100ms with a parameter like failureDetectionTimeout: {code:java} private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { ... sock.connect(addr, 100); // Make it rely on failureDetectionTimeout. ... } {code} 2) Any negative result of the connection checking should be considered as node failed. Currently, we look only at refused connection. Any other exceptions, including a timeout, are treated as living connection: {code:java} private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { ... catch (ConnectException e) { return true; } catch (IOException e) { return false; // Make any error mean lost connection. } return false; } {code} 3) Maximal interval to check previous node should rely on actual failure detection timeout: {code:java} TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...); ... // We got message from previous in less than double connection check interval. boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; // Here should be a timeout of failure detection. if (ok) { // Check case when previous node suddenly died. This will speed up // node failing. ... } res.previousNodeAlive(ok); {code} > Fix backward checking of failed node. > ------------------------------------- > > Key: IGNITE-13016 > URL: https://issues.apache.org/jira/browse/IGNITE-13016 > Project: Ignite > Issue Type: Sub-task > Reporter: Vladimir Steshin > Assignee: Vladimir Steshin > Priority: Major > Labels: iep-45 > Attachments: FailureDetectionResearch.txt, > FailureDetectionResearch_fixed.txt, NodeFailureResearch.patch, > WostCaseStepByStep.txt > > Time Spent: 10m > Remaining Estimate: 0h > > We should fix several drawbacks in the backward checking of failed node. They > prolong node failure detection upto: > ServerImpl.CON_CHECK_INTERVAL + 2 * > IgniteConfiguretion.failureDetectionTimeout + 300ms. > See: > * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' > which emulates long answears on a failed node and measures failure detection > delays. > * '_FailureDetectionResearch.txt_' - results of the test. > * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix. > * '_WostCaseStepByStep.txt_' - description how the worst case happens. > *Suggestions:* > 1) We should replace hardcoded timeout 100ms with a parameter like > failureDetectionTimeout: > {code:java} > private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { > ... > sock.connect(addr, 100); // Make it rely on failureDetectionTimeout. > ... > } > {code} > 2) Any negative result of the connection checking should be considered as > node failed. Currently, we look only at refused connection. Any other > exceptions, including a timeout, are treated as living connection: > {code:java} > private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { > ... > catch (ConnectException e) { > return true; > } > catch (IOException e) { > return false; // Make any error mean lost connection. > } > return false; > } > {code} > 3) Maximal interval to check previous node should rely on actual failure > detection timeout: > {code:java} > TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...); > ... > // We got message from previous in less than double connection check > interval. > boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; // Here should be a > timeout of failure detection. > if (ok) { > // Check case when previous node suddenly died. This will speed up > // node failing. > ... > } > res.previousNodeAlive(ok); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)