[ https://issues.apache.org/jira/browse/IGNITE-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir Steshin updated IGNITE-13016: -------------------------------------- Attachment: FailureDetectionResearch.patch > Fix backward checking of failed node. > ------------------------------------- > > Key: IGNITE-13016 > URL: https://issues.apache.org/jira/browse/IGNITE-13016 > Project: Ignite > Issue Type: Sub-task > Reporter: Vladimir Steshin > Assignee: Vladimir Steshin > Priority: Major > Labels: iep-45 > Attachments: FailureDetectionResearch.patch, WorstCase_log.log, > WostCase.txt > > Time Spent: 10m > Remaining Estimate: 0h > > We should fix several drawbacks in the backward checking of failed node. They > prolongs node failure detection upto: ServerImpl.CON_CHECK_INTERVAL + 2 * > IgniteConfiguretion.failureDetectionTimeout + 300ms. See ‘WostCase.txt’. > Suggestions: > 1) We should replace hardcoded timeout 100ms with a parameter like > failureDetectionTimeout: > {code:java} > private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { > ... > sock.connect(addr, 100); // Make it not a constant. > ... > } > {code} > 2) Any negative result of the connection checking should be considered as > node failed. Currently, we look only at refused connection. Any other > exceptions, including a timeout, are treated as living connection: > {code:java} > private boolean ServerImpl.isConnectionRefused(SocketAddress addr) { > ... > catch (ConnectException e) { > return true; > } > catch (IOException e) { > return false; // Make any error mean lost connection. > } > return false; > } > {code} > 3) Maximal interval to check previous node should rely on actual failure > detection timeout: > {code:java} > TcpDiscoveryHandshakeResponse res = new TcpDiscoveryHandshakeResponse(...); > ... > // We got message from previous in less than double connection check > interval. > boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; // Here should be a > timeout of failure detection. > if (ok) { > // Check case when previous node suddenly died. This will speed up > // node failing. > ... > } > res.previousNodeAlive(ok); > {code} > 4) Remove hardcoded sleep of 200ms when marking previous node alive: > {code:java} > ServerImpl.CrossRingMessageSendState.markLastFailedNodeAlive(){ > ... > try { > Thread.sleep(200); > } > catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > ... > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)