Sergey Chugunov created IGNITE-26986:
----------------------------------------

             Summary: Multi-datacenter awarness for connection recovery 
mechanism
                 Key: IGNITE-26986
                 URL: https://issues.apache.org/jira/browse/IGNITE-26986
             Project: Ignite
          Issue Type: Improvement
            Reporter: Sergey Chugunov
             Fix For: 2.18


Connection recovery mechanism developed in IGNITE-7163 improves topology 
resilience against brief network instability. However it could cause the whole 
cluster going down if a cross-DC network partitioning happens in a 
multi-datacenter environment.

This happens because connection recovery forces nodes to segment from topology 
when they cannot restore connection to the next node in a specified timeout. 
And if a node sits at the edge of its datacenter, and several of its next nodes 
are in the remote DC, then all attempts of the edge node to find an alive next 
will fail because of the partitioning. And if connection recovery timeout isn't 
big enough, the edge node will consider itself as segmented and stop.

Then the previous node of a newly failed one becomes an edge node, and the 
process repeats.

In this case connection recovery mechanism will force the whole cluster to 
shutdown instead of improving stability.

Thereby it should be aware on multi-datacenter envorinments and tweak its 
behavior accordingly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to