[
https://issues.apache.org/jira/browse/IGNITE-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vladimir Steshin updated IGNITE-13014:
--------------------------------------
Description:
Proposal:
Do not check failed node second time. Double node checking prolongs node
failure detection and gives no additional benefits. There are mesh and
hardcoded values in this routine.
For the present, we have double checking of node availability. Let's imagine
node 2 doesn't answer any more. Node 1 becomes unable to ping node 2 and asks
Node 3 to establish permanent connection instead of node 2. Node 3 may try to
check node 2 too. Or may not.
Possible long detection of node failure up to ServerImpl.CON_CHECK_INTERVAL + 2
* IgniteConfiguretion.failureDetectionTimeout + 300ms.
See:
‘NodeFailureResearch.patch'. It creates test 'FailureDetectionResearch' which
emulates long answears on a failed node and measures failure detection delays.
'NodeFailureResearch.txt' - results of the test.
'WostCaseStepByStep.txt' - description how the worst case happens.
was:
Proposal:
Do not check failed node second time. Double node checking prolongs node
failure detection and gives no additional benefits. There are mesh and
hardcoded values in this routine.
For the present, we have double checking of node availability. Let's imagine
node 2 doesn't answer any more. Node 1 becomes unable to ping node 2 and asks
Node 3 to establish permanent connection instead of node 2. Node 3 may try to
check node 2 too. Or may not.
Possible long detection of node failure up to ServerImpl.CON_CHECK_INTERVAL + 2
* IgniteConfiguretion.failureDetectionTimeout + 300ms.
> Remove double checking of node availability.
> ---------------------------------------------
>
> Key: IGNITE-13014
> URL: https://issues.apache.org/jira/browse/IGNITE-13014
> Project: Ignite
> Issue Type: Improvement
> Reporter: Vladimir Steshin
> Assignee: Vladimir Steshin
> Priority: Major
> Labels: iep-45
> Attachments: NodeFailureResearch.patch, WostCaseStepByStep.txt
>
>
> Proposal:
> Do not check failed node second time. Double node checking prolongs node
> failure detection and gives no additional benefits. There are mesh and
> hardcoded values in this routine.
> For the present, we have double checking of node availability. Let's imagine
> node 2 doesn't answer any more. Node 1 becomes unable to ping node 2 and asks
> Node 3 to establish permanent connection instead of node 2. Node 3 may try to
> check node 2 too. Or may not.
> Possible long detection of node failure up to ServerImpl.CON_CHECK_INTERVAL +
> 2 * IgniteConfiguretion.failureDetectionTimeout + 300ms.
> See:
> ‘NodeFailureResearch.patch'. It creates test 'FailureDetectionResearch' which
> emulates long answears on a failed node and measures failure detection delays.
> 'NodeFailureResearch.txt' - results of the test.
> 'WostCaseStepByStep.txt' - description how the worst case happens.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)