[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

Vladimir Steshin (Jira) Tue, 09 Jun 2020 02:44:18 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vladimir Steshin updated IGNITE-13012:
--------------------------------------
    Description: 
Connection failure may not be detected within 
IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
Node ping routine is duplicated.

We should fixes:

1. Failure detection timeout should take in account last sent message. Current 
ping is bound to own time:
{code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
This is weird because any discovery message check connection. 

2. Make connection check interval depend on failure detection timeout (FTD). 
Current value is a constant:
{code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}

3. Remove additional, quickened connection checking.  Once we do fix 1, this 
will become even more useless.
Despite TCP discovery has a period of connection checking, it may send ping 
before this period exhausts. This premature node ping relies on the time of any 
sent or even any received message. 

4. Do not worry user with “Node disconnected” when everything is OK. Once we do 
fix 1 and 3, this will become even more useless. 
Node may log on INFO: “Local node seems to be disconnected from topology …” 
whereas it is not actually disconnected at all.

  was:
Node-to-next-node connection checking has several drawbacks which go together. 
These drawback hindered understanding and catching problems in IGNITE-13016.  
We should fix the following :

1. Failure detection timeout should take in account last sent message. 
Connection check interval should also rely on this time. If we set timeout on 
current message only, we have no guarantee that connection failure is detected 
with failure detection timeout.  
Current ping is bound to own time:
{code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
This is weird because any discovery message check connection. And 
TpcDiscoveryConnectionCheckMessage is just an addition when message queue is 
empty for a long time. 

2. Make connection check interval depend on failure detection timeout (FTD). 
Current value is a constant:
{code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
Let's set it FDT/4 to get enough timeout time since last sent message.

3. Remove additional, quickened connection checking.  Once we do fix 1, this 
will become even more useless.
Despite TCP discovery has a period of connection checking, it may send ping 
before this period exhausts. This premature node ping relies on the time of any 
sent or even any received message. Imagine: if node 2 receives no message from 
node 1 within some time, it decides to do extra ping node 3 not waiting for 
regular ping. Such behavior makes confusion and gives no considerable benefits. 
See {code:java}ServerImpl.RingMessageWorker.failureThresholdReached{code}

4. Do not worry user with “Node disconnected” when everything is OK. Once we do 
fix 1 and 3, this will become even more useless. 
Node may log on INFO: “Local node seems to be disconnected from topology …” 
whereas it is not actually disconnected at all.


> Fix failure detection timeout. Simplify node ping routine.
> ----------------------------------------------------------
>
>                 Key: IGNITE-13012
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13012
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Vladimir Steshin
>            Assignee: Vladimir Steshin
>            Priority: Major
>              Labels: iep-45
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fixes:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node disconnected” when everything is OK. Once we 
> do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

Reply via email to