[ 
https://issues.apache.org/jira/browse/IGNITE-26986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Chugunov updated IGNITE-26986:
-------------------------------------
    Labels: IEP-140  (was: )

> Multi-datacenter awarness for connection recovery mechanism
> -----------------------------------------------------------
>
>                 Key: IGNITE-26986
>                 URL: https://issues.apache.org/jira/browse/IGNITE-26986
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Sergey Chugunov
>            Priority: Major
>              Labels: IEP-140
>             Fix For: 2.18
>
>
> Connection recovery mechanism developed in IGNITE-7163 improves topology 
> resilience against brief network instability. However it could cause the 
> whole cluster going down if a cross-DC network partitioning happens in a 
> multi-datacenter environment.
> This happens because connection recovery forces nodes to segment from 
> topology when they cannot restore connection to the next node in a specified 
> timeout. And if a node sits at the edge of its datacenter, and several of its 
> next nodes are in the remote DC, then all attempts of the edge node to find 
> an alive next will fail because of the partitioning. And if connection 
> recovery timeout isn't big enough, the edge node will consider itself as 
> segmented and stop.
> Then the previous node of a newly failed one becomes an edge node, and the 
> process repeats.
> In this case connection recovery mechanism will force the whole cluster to 
> shutdown instead of improving stability.
> Thereby it should be aware on multi-datacenter envorinments and tweak its 
> behavior accordingly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to