[ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13012:
--------------------------------------
    Description: 
Node-to-next-node connection checking has several drawbacks which go together. 
We should fix the following :

1) Make connection check interval half of actual failure detection timeout. 
Current value is a constant:
{code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}

2) Make connection check interval rely on common time of any last sent message. 
Current ping is bound to own time:
{code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
This is weird because any discovery message check connection. And 
TpcDiscoveryConnectionCheckMessage is just an addition when message queue is 
empty for a long time.

3) Remove additional, randomly appearing and quickened connection checking.  
Once we do #1, this will become even more useless.
Despite TCP discovery has a period of connection checking (see #1), it may send 
ping before this period exhausts. This premature node ping relies on the time 
of any sent or even received message. Imagine: if node 2 receives no message 
from node 1 within some time, it decides to do extra ping node 3 not waiting 
for regular ping. Such behavior makes confusion and gives no benefits. 
See {code:java}ServerImpl.RingMessageWorker.failureThresholdReached{code}

4) Do not worry user with “Node disconnected” when everything is OK. Once we do 
#1, this will become even more useless. Fixing #3 also fixes this issue.
If #3 happens, node writes in the log on INFO: “Local node seems to be 
disconnected from topology …” whereas it is not actually disconnected at all. 
User can see this unexpected and worrying message if he typed 
IgniteConfiguration.failureDetectionTimeout < 500ms.

  was:
Node-to-next-node connection checking has several drawbacks which go together. 
We should fix the following :

1) First thing firts, make connection check interval predictable and dependable 
on the failureDetectionTimeout or similar params. Current value is a constant:
{code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}

2) Make connection check interval rely on common time of any last sent message. 
Current ping is bound to own time:
{code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
This is weird because any discovery message does actually check connection. And 
TpDiscoveryConnectionCheckMessage is just an addition when message queue is 
empty for a long time.

3) Remove additional, randomly appearing quickened connection checking.  Once 
we do #1, this will become even more useless.
Despite we have a period of connection checking (see #1), we can also send ping 
before the period exhausts. This premature node ping relies on the time of any 
sent or even received message. Imagine: if node 2 receives no message from node 
1 within some time, it decides to do extra ping node 3 not waiting for regular 
ping. This happens quite randomly. Such behavior makes confusion and gives no 
benefits. 

4) Do not worry user with “Node disconnected” when everything is OK. Once we do 
#1, this will become even more useless.
If #3 happens, node writes in the log on INFO: “Local node seems to be 
disconnected from topology …” whereas it is not actually disconnected at all. 
User can see this unexpected and worrying message if he typed 
failureDetectionTimeout < 500ms.


> Make node connection checking rely on the configuration. Simplify node ping 
> routine.
> ------------------------------------------------------------------------------------
>
>                 Key: IGNITE-13012
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13012
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Vladimir Steshin
>            Assignee: Vladimir Steshin
>            Priority: Major
>              Labels: iep-45
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Node-to-next-node connection checking has several drawbacks which go 
> together. We should fix the following :
> 1) Make connection check interval half of actual failure detection timeout. 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 2) Make connection check interval rely on common time of any last sent 
> message. Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. And 
> TpcDiscoveryConnectionCheckMessage is just an addition when message queue is 
> empty for a long time.
> 3) Remove additional, randomly appearing and quickened connection checking.  
> Once we do #1, this will become even more useless.
> Despite TCP discovery has a period of connection checking (see #1), it may 
> send ping before this period exhausts. This premature node ping relies on the 
> time of any sent or even received message. Imagine: if node 2 receives no 
> message from node 1 within some time, it decides to do extra ping node 3 not 
> waiting for regular ping. Such behavior makes confusion and gives no 
> benefits. 
> See {code:java}ServerImpl.RingMessageWorker.failureThresholdReached{code}
> 4) Do not worry user with “Node disconnected” when everything is OK. Once we 
> do #1, this will become even more useless. Fixing #3 also fixes this issue.
> If #3 happens, node writes in the log on INFO: “Local node seems to be 
> disconnected from topology …” whereas it is not actually disconnected at all. 
> User can see this unexpected and worrying message if he typed 
> IgniteConfiguration.failureDetectionTimeout < 500ms.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to