[ https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir Steshin updated IGNITE-13016: -------------------------------------- Attachment: (was: WostCaseStepByStep.txt) > Fix backward checking of failed node. > ------------------------------------- > > Key: IGNITE-13016 > URL: https://issues.apache.org/jira/browse/IGNITE-13016 > Project: Ignite > Issue Type: Sub-task > Reporter: Vladimir Steshin > Assignee: Vladimir Steshin > Priority: Major > Labels: iep-45 > Attachments: FailureDetectionResearch.patch, > FailureDetectionResearch.txt, FailureDetectionResearch_fixed.txt, > WostCaseStepByStep.txt > > Time Spent: 10m > Remaining Estimate: 0h > > We should fix several drawbacks in the backward checking of failed node. They > prolong node failure detection upto: > ServerImpl.CON_CHECK_INTERVAL + 2 * > IgniteConfiguretion.failureDetectionTimeout + 300ms. > See: > * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' > which emulates long answears on a failed node and measures failure detection > delays. > * '_FailureDetectionResearch.txt_' - results of the test. > * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix. > * '_WostCaseStepByStep.txt_' - description how the worst case happens. > *Suggestions:* > 1) We should replace hardcoded timeout 100ms with a parameter like > failureDetectionTimeout: > {code:java} > private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { > ... > sock.connect(addr, 100); // Make it rely on failureDetectionTimeout. > ... > } > {code} > 2) Any negative result of the connection checking should be considered as > node failed. Currently, we look only at refused connection. Any other > exceptions, including a timeout, are treated as living connection: > {code:java} > private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { > ... > catch (ConnectException e) { > return true; > } > catch (IOException e) { > return false; // Make any error mean lost connection. > } > return false; > } > {code} > 3) Maximal interval to check previous node should rely on actual failure > detection timeout: > {code:java} > TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...); > ... > // We got message from previous in less than double connection check > interval. > boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; // Here should be a > timeout of failure detection. > if (ok) { > // Check case when previous node suddenly died. This will speed up > // node failing. > ... > } > res.previousNodeAlive(ok); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)