[
https://issues.apache.org/jira/browse/IGNITE-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vladimir Steshin closed IGNITE-13111.
-------------------------------------
> Simplify backward checking of node connection.
> ----------------------------------------------
>
> Key: IGNITE-13111
> URL: https://issues.apache.org/jira/browse/IGNITE-13111
> Project: Ignite
> Issue Type: Improvement
> Reporter: Vladimir Steshin
> Assignee: Vladimir Steshin
> Priority: Major
> Labels: iep-45
> Attachments: FailureDetectionResearch.patch,
> FailureDetectionResearch.txt, FailureDetectionResearch_fixed.txt,
> WostCaseStepByStep.txt
>
>
> We should fix several drawbacks in the backward checking of failed node. They
> prolong node failure detection upto:
> ServerImpl.CON_CHECK_INTERVAL + 2 *
> IgniteConfiguretion.failureDetectionTimeout + 300ms.
> See:
> * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch'
> which emulates long answears on a failed node and measures failure detection
> delays.
> * '_FailureDetectionResearch.txt_' - results of the test.
> * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix.
> * '_WostCaseStepByStep.txt_' - description how the worst case happens.
> *Suggestion:*
> 1) We can simplify backward connection checking as we implement IGNITE-13012.
> Once we get robust, predictable connection ping, we don't need to check
> previous node because we can see whether it sent ping to current node within
> failure detection timeout. If not, previous node can be considered lost.
> Instead of:
> {code:java}
> // Node cannot connect to it's next (for local node it's previous).
> // Need to check connectivity to it.
> long rcvdTime = lastRingMsgReceivedTime;
> long now = U.currentTimeMillis();
> // We got message from previous in less than double
> connection check interval.
> boolean ok = rcvdTime + effectiveExchangeTimeout() >=
> now;
> TcpDiscoveryNode previous = null;
> if (ok) {
> // Check case when previous node suddenly died.
> This will speed up
> // node failing.
> Checking connection to previous node
> }
> {code}
> we could wait for ping from previous node. Scenario:
> * n1 (Node1) failed to connect to n2.
> * n1 asks n3 to establish connection instead of n2.
> * n3 waits for ping form n2 for the rest of failure detection timeout.
> * If n3 received ping from n2, it connects with n1. Or answers n1 that n2 is
> considered alive.
> 2) Then, seems we can remove:
> {code:java}
> ServerImpl.SocketReader.isConnectionRefused(SocketAddress addr);
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)