[ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--------------------------------------
    Description: 
Node-to-next-node connection checking has several drawbacks which go together. 
We should fix the following :

1. Make connection check interval depend on failure detection timeout (FTD). 
Current value is a constant:
{code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
Let's set it FDT/2. Another half of FDT - timeout on ping message exchange.

2. Make connection check interval rely on common time of any last sent message. 
Current ping is bound to own time:
{code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
This is weird because any discovery message check connection. And 
TpcDiscoveryConnectionCheckMessage is just an addition when message queue is 
empty for a long time.

3. Remove additional, quickened connection checking.  Once we do fix 1, this 
will become even more useless.
Despite TCP discovery has a period of connection checking, it may send ping 
before this period exhausts. This premature node ping relies on the time of any 
sent or even received message. Imagine: if node 2 receives no message from node 
1 within some time, it decides to do extra ping node 3 not waiting for regular 
ping. Such behavior makes confusion and gives no benefits. 
See {code:java}ServerImpl.RingMessageWorker.failureThresholdReached{code}

4. Do not worry user with “Node disconnected” when everything is OK. Once we do 
fix 1, this will become even more useless. Fix 3 also fixes this issue.
If 3 happens, node logs on INFO: “Local node seems to be disconnected from 
topology …” whereas it is not actually disconnected at all.

  was:
Node-to-next-node connection checking has several drawbacks which go together. 
We should fix the following :

1) Make connection check interval half of actual failure detection timeout. 
Current value is a constant:
{code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}

2) Make connection check interval rely on common time of any last sent message. 
Current ping is bound to own time:
{code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
This is weird because any discovery message check connection. And 
TpcDiscoveryConnectionCheckMessage is just an addition when message queue is 
empty for a long time.

3) Remove additional, randomly appearing and quickened connection checking.  
Once we do #1, this will become even more useless.
Despite TCP discovery has a period of connection checking (see #1), it may send 
ping before this period exhausts. This premature node ping relies on the time 
of any sent or even received message. Imagine: if node 2 receives no message 
from node 1 within some time, it decides to do extra ping node 3 not waiting 
for regular ping. Such behavior makes confusion and gives no benefits. 
See {code:java}ServerImpl.RingMessageWorker.failureThresholdReached{code}

4) Do not worry user with “Node disconnected” when everything is OK. Once we do 
#1, this will become even more useless. Fixing #3 also fixes this issue.
If #3 happens, node writes in the log on INFO: “Local node seems to be 
disconnected from topology …” whereas it is not actually disconnected at all. 
User can see this unexpected and worrying message if he typed 
IgniteConfiguration.failureDetectionTimeout < 500ms.


> Make node connection checking rely on the configuration. Simplify node ping 
> routine.
> ------------------------------------------------------------------------------------
>
>                 Key: IGNITE-13012
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13012
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Vladimir Steshin
>            Assignee: Vladimir Steshin
>            Priority: Major
>              Labels: iep-45
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Node-to-next-node connection checking has several drawbacks which go 
> together. We should fix the following :
> 1. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> Let's set it FDT/2. Another half of FDT - timeout on ping message exchange.
> 2. Make connection check interval rely on common time of any last sent 
> message. Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. And 
> TpcDiscoveryConnectionCheckMessage is just an addition when message queue is 
> empty for a long time.
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even received message. Imagine: if node 2 receives no message 
> from node 1 within some time, it decides to do extra ping node 3 not waiting 
> for regular ping. Such behavior makes confusion and gives no benefits. 
> See {code:java}ServerImpl.RingMessageWorker.failureThresholdReached{code}
> 4. Do not worry user with “Node disconnected” when everything is OK. Once we 
> do fix 1, this will become even more useless. Fix 3 also fixes this issue.
> If 3 happens, node logs on INFO: “Local node seems to be disconnected from 
> topology …” whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to