Roman Puchkovskiy created IGNITE-18605:
------------------------------------------

             Summary: Account for inherent unreliability of messaging
                 Key: IGNITE-18605
                 URL: https://issues.apache.org/jira/browse/IGNITE-18605
             Project: Ignite
          Issue Type: Improvement
          Components: networking
            Reporter: Roman Puchkovskiy
             Fix For: 3.0.0-beta2


We use ScaleCube for discovery. It uses SWIM protocol that relies on timeuts. 
This means that for some reason a node might not send a ping/pong message 
timely (for example, if it encounters a long JVM pause), which results in other 
nodes thinking that it has disappeared.

First, we should carefully choose the default values for timeouts: if they are 
too low, the probability of dropping a node from a cluster is very high (but if 
they are too high, some tests might take a lot longer).

Also, we should account for the possibility of a node to be dropped from the 
cluster. This means that:
 # When we get a node from a physical topology, we must always check that it 
was returned and handle 'no such node in the topology' gracefully
 # Long message exchanges (for instance, streaming a RAFT snapshot) must be 
robust so as to survive short disappearances of nodes (it would be bad to waste 
a snapshot installation that already took 30 minutes just because of a 
transient failure)
 # We need to avoid hangs in cases if someone waits for a message that never 
gets delivered due to such a transient failure (could this be the cause for 
IGNITE-18506, where sometimes a message never arrives to an internal cursor?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to