[
https://issues.apache.org/jira/browse/IGNITE-26986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Chugunov updated IGNITE-26986:
-------------------------------------
Labels: IEP-140 (was: )
> Multi-datacenter awarness for connection recovery mechanism
> -----------------------------------------------------------
>
> Key: IGNITE-26986
> URL: https://issues.apache.org/jira/browse/IGNITE-26986
> Project: Ignite
> Issue Type: Improvement
> Reporter: Sergey Chugunov
> Priority: Major
> Labels: IEP-140
> Fix For: 2.18
>
>
> Connection recovery mechanism developed in IGNITE-7163 improves topology
> resilience against brief network instability. However it could cause the
> whole cluster going down if a cross-DC network partitioning happens in a
> multi-datacenter environment.
> This happens because connection recovery forces nodes to segment from
> topology when they cannot restore connection to the next node in a specified
> timeout. And if a node sits at the edge of its datacenter, and several of its
> next nodes are in the remote DC, then all attempts of the edge node to find
> an alive next will fail because of the partitioning. And if connection
> recovery timeout isn't big enough, the edge node will consider itself as
> segmented and stop.
> Then the previous node of a newly failed one becomes an edge node, and the
> process repeats.
> In this case connection recovery mechanism will force the whole cluster to
> shutdown instead of improving stability.
> Thereby it should be aware on multi-datacenter envorinments and tweak its
> behavior accordingly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)