[jira] [Updated] (IGNITE-18605) Account for inherent unreliability of messaging

Evgeny Stanilovsky (Jira) Sun, 19 Oct 2025 23:26:07 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-18605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Evgeny Stanilovsky updated IGNITE-18605:
----------------------------------------
    Fix Version/s: 3.2
                       (was: 3.1)

> Account for inherent unreliability of messaging
> -----------------------------------------------
>
>                 Key: IGNITE-18605
>                 URL: https://issues.apache.org/jira/browse/IGNITE-18605
>             Project: Ignite
>          Issue Type: Improvement
>          Components: networking
>            Reporter: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3, tech-debt
>             Fix For: 3.2
>
>
> We use ScaleCube for discovery. It uses SWIM protocol that relies on timeuts. 
> This means that for some reason a node might not send a ping/pong message 
> timely (for example, if it encounters a long JVM pause), which results in 
> other nodes thinking that it has disappeared.
> First, we should carefully choose the default values for timeouts: if they 
> are too low, the probability of dropping a node from a cluster is very high 
> (but if they are too high, some tests might take a lot longer).
> Also, we should account for the possibility of a node to be dropped from the 
> cluster. This means that:
>  # When we get a node from a physical topology, we must always check that it 
> was returned and handle 'no such node in the topology' gracefully
>  # Long message exchanges (for instance, streaming a RAFT snapshot) must be 
> robust so as to survive short disappearances of nodes (it would be bad to 
> waste a snapshot installation that already took 30 minutes just because of a 
> transient failure)
>  # We need to avoid hangs in cases if someone waits for a message that never 
> gets delivered due to such a transient failure (could this be the cause for 
> IGNITE-18506, where sometimes a message never arrives to an internal cursor?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-18605) Account for inherent unreliability of messaging

Reply via email to