[ https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118541#comment-17118541 ]
Ignite TC Bot commented on IGNITE-13016: ---------------------------------------- {panel:title=Branch: [pull/7838/head] Base: [master] : No blockers found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel} [TeamCity *--> Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=5339425&buildTypeId=IgniteTests24Java8_RunAll] > Fix backward checking of failed node. > ------------------------------------- > > Key: IGNITE-13016 > URL: https://issues.apache.org/jira/browse/IGNITE-13016 > Project: Ignite > Issue Type: Sub-task > Reporter: Vladimir Steshin > Assignee: Vladimir Steshin > Priority: Major > Labels: iep-45 > Attachments: FailureDetectionResearch.txt, > FailureDetectionResearch_fixed.txt, NodeFailureResearch.patch, > WostCaseStepByStep.txt > > Time Spent: 10m > Remaining Estimate: 0h > > We should fix several drawbacks in the backward checking of failed node. They > prolongs node failure detection upto: > ServerImpl.CON_CHECK_INTERVAL + 2 * > IgniteConfiguretion.failureDetectionTimeout + 300ms. > See: > * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' > which emulates long answears on a failed node and measures failure detection > delays. > * '_NodeFailureResearch.txt_' - results of the test. > * 'NodeFailureResearch_fixed.txt' - results of the test after this fix. > * '_WostCaseStepByStep.txt_' - description how the worst case happens. > *Suggestions:* > 1) We should replace hardcoded timeout 100ms with a parameter like > failureDetectionTimeout: > {code:java} > private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { > ... > sock.connect(addr, 100); // Make it rely on failureDetectionTimeout. > ... > } > {code} > 2) Any negative result of the connection checking should be considered as > node failed. Currently, we look only at refused connection. Any other > exceptions, including a timeout, are treated as living connection: > {code:java} > private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { > ... > catch (ConnectException e) { > return true; > } > catch (IOException e) { > return false; // Make any error mean lost connection. > } > return false; > } > {code} > 3) Maximal interval to check previous node should rely on actual failure > detection timeout: > {code:java} > TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...); > ... > // We got message from previous in less than double connection check > interval. > boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; // Here should be a > timeout of failure detection. > if (ok) { > // Check case when previous node suddenly died. This will speed up > // node failing. > ... > } > res.previousNodeAlive(ok); > {code} > 4) Remove hardcoded sleep of 200ms when marking previous node alive: > {code:java} > ServerImpl.CrossRingMessageSendState.markLastFailedNodeAlive(){ > ... > try { > Thread.sleep(200); > } > catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > ... > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)