[
https://issues.apache.org/jira/browse/IGNITE-18605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Evgeny Stanilovsky updated IGNITE-18605:
----------------------------------------
Fix Version/s: 3.2
(was: 3.1)
> Account for inherent unreliability of messaging
> -----------------------------------------------
>
> Key: IGNITE-18605
> URL: https://issues.apache.org/jira/browse/IGNITE-18605
> Project: Ignite
> Issue Type: Improvement
> Components: networking
> Reporter: Roman Puchkovskiy
> Priority: Major
> Labels: ignite-3, tech-debt
> Fix For: 3.2
>
>
> We use ScaleCube for discovery. It uses SWIM protocol that relies on timeuts.
> This means that for some reason a node might not send a ping/pong message
> timely (for example, if it encounters a long JVM pause), which results in
> other nodes thinking that it has disappeared.
> First, we should carefully choose the default values for timeouts: if they
> are too low, the probability of dropping a node from a cluster is very high
> (but if they are too high, some tests might take a lot longer).
> Also, we should account for the possibility of a node to be dropped from the
> cluster. This means that:
> # When we get a node from a physical topology, we must always check that it
> was returned and handle 'no such node in the topology' gracefully
> # Long message exchanges (for instance, streaming a RAFT snapshot) must be
> robust so as to survive short disappearances of nodes (it would be bad to
> waste a snapshot installation that already took 30 minutes just because of a
> transient failure)
> # We need to avoid hangs in cases if someone waits for a message that never
> gets delivered due to such a transient failure (could this be the cause for
> IGNITE-18506, where sometimes a message never arrives to an internal cursor?)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)