[
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vladimir Steshin updated IGNITE-13012:
--------------------------------------
Description:
Node-to-next-node connection checking has several drawbacks which go together.
We should fix the following :
1) Make connection check interval half of actual failure detection timeout.
Current value is a constant:
{code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
2) Make connection check interval rely on common time of any last sent message.
Current ping is bound to own time:
{code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
This is weird because any discovery message check connection. And
TpcDiscoveryConnectionCheckMessage is just an addition when message queue is
empty for a long time.
3) Remove additional, randomly appearing and quickened connection checking.
Once we do #1, this will become even more useless.
Despite TCP discovery has a period of connection checking (see #1), it may send
ping before this period exhausts. This premature node ping relies on the time
of any sent or even received message. Imagine: if node 2 receives no message
from node 1 within some time, it decides to do extra ping node 3 not waiting
for regular ping. Such behavior makes confusion and gives no benefits.
See {code:java}ServerImpl.RingMessageWorker.failureThresholdReached{code}
4) Do not worry user with “Node disconnected” when everything is OK. Once we do
#1, this will become even more useless. Fixing #3 also fixes this issue.
If #3 happens, node writes in the log on INFO: “Local node seems to be
disconnected from topology …” whereas it is not actually disconnected at all.
User can see this unexpected and worrying message if he typed
IgniteConfiguration.failureDetectionTimeout < 500ms.
was:
Node-to-next-node connection checking has several drawbacks which go together.
We should fix the following :
1) First thing firts, make connection check interval predictable and dependable
on the failureDetectionTimeout or similar params. Current value is a constant:
{code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
2) Make connection check interval rely on common time of any last sent message.
Current ping is bound to own time:
{code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
This is weird because any discovery message does actually check connection. And
TpDiscoveryConnectionCheckMessage is just an addition when message queue is
empty for a long time.
3) Remove additional, randomly appearing quickened connection checking. Once
we do #1, this will become even more useless.
Despite we have a period of connection checking (see #1), we can also send ping
before the period exhausts. This premature node ping relies on the time of any
sent or even received message. Imagine: if node 2 receives no message from node
1 within some time, it decides to do extra ping node 3 not waiting for regular
ping. This happens quite randomly. Such behavior makes confusion and gives no
benefits.
4) Do not worry user with “Node disconnected” when everything is OK. Once we do
#1, this will become even more useless.
If #3 happens, node writes in the log on INFO: “Local node seems to be
disconnected from topology …” whereas it is not actually disconnected at all.
User can see this unexpected and worrying message if he typed
failureDetectionTimeout < 500ms.
> Make node connection checking rely on the configuration. Simplify node ping
> routine.
> ------------------------------------------------------------------------------------
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
> Issue Type: Improvement
> Reporter: Vladimir Steshin
> Assignee: Vladimir Steshin
> Priority: Major
> Labels: iep-45
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Node-to-next-node connection checking has several drawbacks which go
> together. We should fix the following :
> 1) Make connection check interval half of actual failure detection timeout.
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 2) Make connection check interval rely on common time of any last sent
> message. Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. And
> TpcDiscoveryConnectionCheckMessage is just an addition when message queue is
> empty for a long time.
> 3) Remove additional, randomly appearing and quickened connection checking.
> Once we do #1, this will become even more useless.
> Despite TCP discovery has a period of connection checking (see #1), it may
> send ping before this period exhausts. This premature node ping relies on the
> time of any sent or even received message. Imagine: if node 2 receives no
> message from node 1 within some time, it decides to do extra ping node 3 not
> waiting for regular ping. Such behavior makes confusion and gives no
> benefits.
> See {code:java}ServerImpl.RingMessageWorker.failureThresholdReached{code}
> 4) Do not worry user with “Node disconnected” when everything is OK. Once we
> do #1, this will become even more useless. Fixing #3 also fixes this issue.
> If #3 happens, node writes in the log on INFO: “Local node seems to be
> disconnected from topology …” whereas it is not actually disconnected at all.
> User can see this unexpected and worrying message if he typed
> IgniteConfiguration.failureDetectionTimeout < 500ms.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)