[ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--------------------------------------
    Description: 
Connection failure may not be detected within 
IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
Node ping routine is duplicated.

We should fix:

1. Failure detection timeout should take in account last sent message. Current 
ping is bound to own time:
{code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
This is weird because any discovery message check connection. 

2. Make connection check interval depend on failure detection timeout (FTD). 
Current value is a constant:
{code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}

3. Remove additional, quickened connection checking.  Once we do fix 1, this 
will become even more useless.
Despite TCP discovery has a period of connection checking, it may send ping 
before this period exhausts. This premature node ping relies on the time of any 
sent or even any received message. 

4. Do not worry user with “Node seems disconnected” when everything is OK. Once 
we do fix 1 and 3, this will become even more useless. 
Node may log on INFO: “Local node seems to be disconnected from topology …” 
whereas it is not actually disconnected at all.

  was:
Connection failure may not be detected within 
IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
Node ping routine is duplicated.

We should fix:

1. Failure detection timeout should take in account last sent message. Current 
ping is bound to own time:
{code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
This is weird because any discovery message check connection. 

2. Make connection check interval depend on failure detection timeout (FTD). 
Current value is a constant:
{code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}

3. Remove additional, quickened connection checking.  Once we do fix 1, this 
will become even more useless.
Despite TCP discovery has a period of connection checking, it may send ping 
before this period exhausts. This premature node ping relies on the time of any 
sent or even any received message. 

4. Do not worry user with “Node disconnected” when everything is OK. Once we do 
fix 1 and 3, this will become even more useless. 
Node may log on INFO: “Local node seems to be disconnected from topology …” 
whereas it is not actually disconnected at all.


> Fix failure detection timeout. Simplify node ping routine.
> ----------------------------------------------------------
>
>                 Key: IGNITE-13012
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13012
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Vladimir Steshin
>            Assignee: Vladimir Steshin
>            Priority: Major
>              Labels: iep-45
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to