[
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vladimir Steshin updated IGNITE-13012:
--------------------------------------
Description:
Connection failure may not be detected within
IgniteConfiguration.failureDetectionTimeout. Actual worst delay is:
ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout.
Node ping routine is duplicated.
We should fixes:
1. Failure detection timeout should take in account last sent message. Current
ping is bound to own time:
{code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
This is weird because any discovery message check connection.
2. Make connection check interval depend on failure detection timeout (FTD).
Current value is a constant:
{code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
3. Remove additional, quickened connection checking. Once we do fix 1, this
will become even more useless.
Despite TCP discovery has a period of connection checking, it may send ping
before this period exhausts. This premature node ping relies on the time of any
sent or even any received message.
4. Do not worry user with “Node disconnected” when everything is OK. Once we do
fix 1 and 3, this will become even more useless.
Node may log on INFO: “Local node seems to be disconnected from topology …”
whereas it is not actually disconnected at all.
was:
Node-to-next-node connection checking has several drawbacks which go together.
These drawback hindered understanding and catching problems in IGNITE-13016.
We should fix the following :
1. Failure detection timeout should take in account last sent message.
Connection check interval should also rely on this time. If we set timeout on
current message only, we have no guarantee that connection failure is detected
with failure detection timeout.
Current ping is bound to own time:
{code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
This is weird because any discovery message check connection. And
TpcDiscoveryConnectionCheckMessage is just an addition when message queue is
empty for a long time.
2. Make connection check interval depend on failure detection timeout (FTD).
Current value is a constant:
{code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
Let's set it FDT/4 to get enough timeout time since last sent message.
3. Remove additional, quickened connection checking. Once we do fix 1, this
will become even more useless.
Despite TCP discovery has a period of connection checking, it may send ping
before this period exhausts. This premature node ping relies on the time of any
sent or even any received message. Imagine: if node 2 receives no message from
node 1 within some time, it decides to do extra ping node 3 not waiting for
regular ping. Such behavior makes confusion and gives no considerable benefits.
See {code:java}ServerImpl.RingMessageWorker.failureThresholdReached{code}
4. Do not worry user with “Node disconnected” when everything is OK. Once we do
fix 1 and 3, this will become even more useless.
Node may log on INFO: “Local node seems to be disconnected from topology …”
whereas it is not actually disconnected at all.
> Fix failure detection timeout. Simplify node ping routine.
> ----------------------------------------------------------
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
> Issue Type: Improvement
> Reporter: Vladimir Steshin
> Assignee: Vladimir Steshin
> Priority: Major
> Labels: iep-45
> Time Spent: 2h
> Remaining Estimate: 0h
>
> Connection failure may not be detected within
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is:
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout.
> Node ping routine is duplicated.
> We should fixes:
> 1. Failure detection timeout should take in account last sent message.
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection.
> 2. Make connection check interval depend on failure detection timeout (FTD).
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking. Once we do fix 1, this
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping
> before this period exhausts. This premature node ping relies on the time of
> any sent or even any received message.
> 4. Do not worry user with “Node disconnected” when everything is OK. Once we
> do fix 1 and 3, this will become even more useless.
> Node may log on INFO: “Local node seems to be disconnected from topology …”
> whereas it is not actually disconnected at all.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)