Sergey Chugunov created IGNITE-26986:
----------------------------------------
Summary: Multi-datacenter awarness for connection recovery
mechanism
Key: IGNITE-26986
URL: https://issues.apache.org/jira/browse/IGNITE-26986
Project: Ignite
Issue Type: Improvement
Reporter: Sergey Chugunov
Fix For: 2.18
Connection recovery mechanism developed in IGNITE-7163 improves topology
resilience against brief network instability. However it could cause the whole
cluster going down if a cross-DC network partitioning happens in a
multi-datacenter environment.
This happens because connection recovery forces nodes to segment from topology
when they cannot restore connection to the next node in a specified timeout.
And if a node sits at the edge of its datacenter, and several of its next nodes
are in the remote DC, then all attempts of the edge node to find an alive next
will fail because of the partitioning. And if connection recovery timeout isn't
big enough, the edge node will consider itself as segmented and stop.
Then the previous node of a newly failed one becomes an edge node, and the
process repeats.
In this case connection recovery mechanism will force the whole cluster to
shutdown instead of improving stability.
Thereby it should be aware on multi-datacenter envorinments and tweak its
behavior accordingly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)