[
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vladimir Steshin updated IGNITE-13012:
--------------------------------------
Labels: iep-45 (was: )
> Make node connection checking rely on the configuration. Simplify node ping
> routine.
> ------------------------------------------------------------------------------------
>
> Key: IGNITE-13012
> URL: https://issues.apache.org/jira/browse/IGNITE-13012
> Project: Ignite
> Issue Type: Improvement
> Reporter: Vladimir Steshin
> Assignee: Vladimir Steshin
> Priority: Major
> Labels: iep-45
>
> Current noted-to-node connection checking has several drawbacks:
> 1) Minimal connection checking interval is not bound to failure detection
> parameters:
> static int ServerImpls.CON_CHECK_INTERVAL = 500;
> 2) Connection checking is made as ability of periodical message sending
> (TcpDiscoveryConnectionCheckMessage). It is bound to own time (ServerImpl.
> RingMessageWorker.lastTimeConnCheckMsgSent), not to common time of last sent
> message. This is weird because any discovery message actually checks
> connection. And TpDiscoveryConnectionCheckMessage is just an addition when
> message queue is empty for a long time.
> 3) Period of Node-to-Node connection checking can be sometimes shortened
> for strange reason: if no sent or received message appears within
> failureDetectionTimeout. Here, despite we have minimal period of connection
> checking (ServerImpls.CON_CHECK_INTERVAL), we can also send
> TpDiscoveryConnectionCheckMessage before this period exhausted. Moreover,
> this premature node ping relies also on time of last received message.
> Imagine: if node 2 receives no message from node 1 within some time it
> decides to do extra ping node 3 not waiting for regular ping interval. Such
> behavior makes confusion and gives no additional guaranties.
> 4) If #3 happens, node writes in the log on INFO: “Local node seems to be
> disconnected from topology …” whereas it is not actually disconnected. User
> can see this message if he typed failureDetectionTimeout < 500ms. I wouldn’t
> like seeing INFO in a log saying a node is might be disconnected. This sounds
> like some troubles raised in network. But not as everything is OK.
> Suggestions:
> 1) Make connection check interval be based on failureDetectionTimeout or
> similar params.
> 2) Make connection check interval rely on common time of last sent
> message. Not on dedicated time.
> 3) Remove additional, random, quickened connection checking.
> 4) Do not worry user with “Node disconnected” when everything is OK.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)