[ 
https://issues.apache.org/jira/browse/IGNITE-22377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-22377:
-----------------------------------
    Release Note: Node excluded from logical topology will be stopped by 
failure handler if it re-enters the cluster after its segmentation

> Choose node to fail on a refused handshake
> ------------------------------------------
>
>                 Key: IGNITE-22377
>                 URL: https://issues.apache.org/jira/browse/IGNITE-22377
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Roman Puchkovskiy
>            Assignee: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.2
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Currently, if during a handshake a node gets refused because it's stale from 
> the point of view of the node to which it connects, the refused node notifies 
> its FailureHandler to force node restart.
> If a network partition happens, this might cause problems when it disappears: 
> nodes from different segments will start sniping each other. In the worst 
> case, a single segmented node might make the whole cluster (but itself) 
> restart if.
> It is suggested that the refusing node sends the following information about 
> the physical topology as it sees it to the refused node:
>  # Number of nodes in the PT
>  # Min ID of nodes in the PT
> The refused node will only restart if the number of nodes in the PT, as it 
> sees it, is less than the number of nodes in the PT of the refusing node; if 
> the sizes are equal, then comparing Min IDs of nodes in the PT will allow to 
> make a determenistic decision.
> This idea needs to be thought through and improved (or rejected).
> h3. Update
> The idea is rejected. The main justification for it is a complete 
> unpredictability of a proposed behavior when cluster consists of two nodes. 
> It makes too many "normal" tests fail for various reasons.
> The approach is replaced with validating the version of logical topology. 
> This version cannot be increased without working CMG, which means that only a 
> healthy part of the cluster can do that. So, if a node notices that it is 
> rejected by another node with a higher logical topology version, it should 
> stop itself. If versions are equal, then nothing happens, nodes will have to 
> be stopped manually.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to