[ https://issues.apache.org/jira/browse/IGNITE-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir Steshin updated IGNITE-13111: -------------------------------------- Attachment: FailureDetectionResearch_fixed.txt > Simplify backward checking of node connection. > ---------------------------------------------- > > Key: IGNITE-13111 > URL: https://issues.apache.org/jira/browse/IGNITE-13111 > Project: Ignite > Issue Type: Improvement > Reporter: Vladimir Steshin > Assignee: Vladimir Steshin > Priority: Major > Attachments: FailureDetectionResearch.patch, > FailureDetectionResearch.txt, FailureDetectionResearch_fixed.txt, > WostCaseStepByStep.txt > > > We should fix several drawbacks in the backward checking of failed node. They > prolong node failure detection upto: > ServerImpl.CON_CHECK_INTERVAL + 2 * > IgniteConfiguretion.failureDetectionTimeout + 300ms. > See: > * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' > which emulates long answears on a failed node and measures failure detection > delays. > * '_FailureDetectionResearch.txt_' - results of the test. > * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix. > * '_WostCaseStepByStep.txt_' - description how the worst case happens. > *Suggestion:* > 1) We can simplify backward connection checking as we implement IGNITE-13012. > Once we get robust, predictable connection ping, we don't need to check > previous node because we can see whether it sent ping to current node within > failure detection timeout. If not, previous node can be considered lost. > Instead of: > {code:java} > // Node cannot connect to it's next (for local node it's previous). > // Need to check connectivity to it. > long rcvdTime = lastRingMsgReceivedTime; > long now = U.currentTimeMillis(); > // We got message from previous in less than double > connection check interval. > boolean ok = rcvdTime + effectiveExchangeTimeout() >= > now; > TcpDiscoveryNode previous = null; > if (ok) { > // Check case when previous node suddenly died. > This will speed up > // node failing. > Checking connection to previous node > } > {code} > 2) Then, seems we can remove: > {code:java} > ServerImpl.SocketReader.isConnectionRefused(SocketAddress addr); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)