Roman Puchkovskiy created IGNITE-18605:
------------------------------------------
Summary: Account for inherent unreliability of messaging
Key: IGNITE-18605
URL: https://issues.apache.org/jira/browse/IGNITE-18605
Project: Ignite
Issue Type: Improvement
Components: networking
Reporter: Roman Puchkovskiy
Fix For: 3.0.0-beta2
We use ScaleCube for discovery. It uses SWIM protocol that relies on timeuts.
This means that for some reason a node might not send a ping/pong message
timely (for example, if it encounters a long JVM pause), which results in other
nodes thinking that it has disappeared.
First, we should carefully choose the default values for timeouts: if they are
too low, the probability of dropping a node from a cluster is very high (but if
they are too high, some tests might take a lot longer).
Also, we should account for the possibility of a node to be dropped from the
cluster. This means that:
# When we get a node from a physical topology, we must always check that it
was returned and handle 'no such node in the topology' gracefully
# Long message exchanges (for instance, streaming a RAFT snapshot) must be
robust so as to survive short disappearances of nodes (it would be bad to waste
a snapshot installation that already took 30 minutes just because of a
transient failure)
# We need to avoid hangs in cases if someone waits for a message that never
gets delivered due to such a transient failure (could this be the cause for
IGNITE-18506, where sometimes a message never arrives to an internal cursor?)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)