[
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vladimir Steshin updated IGNITE-13012:
--------------------------------------
Description:
Node-to-next-node connection checking has several drawbacks which go together.
These drawback hindered understanding and catching problems in IGNITE-13016.
We should fix the following :
1. Make connection check interval depend on failure detection timeout (FTD).
Current value is a constant:
{code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
Let's set it FDT/2. Another half of FDT - timeout on ping message exchange.
2. Make connection check interval rely on common time of any last sent message.
Current ping is bound to own time:
{code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
This is weird because any discovery message check connection. And
TpcDiscoveryConnectionCheckMessage is just an addition when message queue is
empty for a long time.
3. Remove additional, quickened connection checking. Once we do fix 1, this
will become even more useless.
Despite TCP discovery has a period of connection checking, it may send ping
before this period exhausts. This premature node ping relies on the time of any
sent or even any received message. Imagine: if node 2 receives no message from
node 1 within some time, it decides to do extra ping node 3 not waiting for
regular ping. Such behavior makes confusion and gives no considerable benefits.
See {code:java}ServerImpl.RingMessageWorker.failureThresholdReached{code}
4. Do not worry user with “Node disconnected” when everything is OK. Once we do
fix 1 and 3, this will become even more useless.
Node may log on INFO: “Local node seems to be disconnected from topology …”
whereas it is not actually disconnected at all.
was:
Node-to-next-node connection checking has several drawbacks which go together.
We should fix the following :
1. Make connection check interval depend on failure detection timeout (FTD).
Current value is a constant:
{code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
Let's set it FDT/2. Another half of FDT - timeout on ping message exchange.
2. Make connection check interval rely on common time of any last sent message.
Current ping is bound to own time:
{code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
This is weird because any discovery message check connection. And
TpcDiscoveryConnectionCheckMessage is just an addition when message queue is
empty for a long time.
3. Remove additional, quickened connection checking. Once we do fix 1, this
will become even more useless.
Despite TCP discovery has a period of connection checking, it may send ping
before this period exhausts. This premature node ping relies on the time of any
sent or even received message. Imagine: if node 2 receives no message from node
1 within some time, it decides to do extra ping node 3 not waiting for regular
ping. Such behavior makes confusion and gives no benefits.
See {code:java}ServerImpl.RingMessageWorker.failureThresholdReached{code}
4. Do not worry user with “Node disconnected” when everything is OK. Once we do
fix 1, this will become even more useless. Fix 3 also fixes this issue.
If 3 happens, node logs on INFO: “Local node seems to be disconnected from
topology …” whereas it is not actually disconnected at all.
> Make node connection checking rely on the configuration. Simplify node ping
> routine.
> ------------------------------------------------------------------------------------
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
> Issue Type: Improvement
> Reporter: Vladimir Steshin
> Assignee: Vladimir Steshin
> Priority: Major
> Labels: iep-45
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Node-to-next-node connection checking has several drawbacks which go
> together. These drawback hindered understanding and catching problems in
> IGNITE-13016. We should fix the following :
> 1. Make connection check interval depend on failure detection timeout (FTD).
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> Let's set it FDT/2. Another half of FDT - timeout on ping message exchange.
> 2. Make connection check interval rely on common time of any last sent
> message. Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. And
> TpcDiscoveryConnectionCheckMessage is just an addition when message queue is
> empty for a long time.
> 3. Remove additional, quickened connection checking. Once we do fix 1, this
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping
> before this period exhausts. This premature node ping relies on the time of
> any sent or even any received message. Imagine: if node 2 receives no message
> from node 1 within some time, it decides to do extra ping node 3 not waiting
> for regular ping. Such behavior makes confusion and gives no considerable
> benefits.
> See {code:java}ServerImpl.RingMessageWorker.failureThresholdReached{code}
> 4. Do not worry user with “Node disconnected” when everything is OK. Once we
> do fix 1 and 3, this will become even more useless.
> Node may log on INFO: “Local node seems to be disconnected from topology …”
> whereas it is not actually disconnected at all.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)