[ 
https://issues.apache.org/jira/browse/IGNITE-18712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718029#comment-17718029
 ] 

Roman Puchkovskiy commented on IGNITE-18712:
--------------------------------------------

Thanks!

> Do not allow a node excluded from Physical Topology to enter the topology 
> again
> -------------------------------------------------------------------------------
>
>                 Key: IGNITE-18712
>                 URL: https://issues.apache.org/jira/browse/IGNITE-18712
>             Project: Ignite
>          Issue Type: New Feature
>            Reporter: Roman Puchkovskiy
>            Assignee: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The following scenario is possible:
>  # Node X is a part of PT
>  # Its network cable gets unplugged, but the node X keeps being alive
>  # After proper timeouts, other nodes remove the node X from PT, so their 
> {{MessagingServices}} drop messages still not delivered to node X
>  # The network cable gets plugged again, so the node X attempts to enter the 
> PT with the same old ID (aka Launch ID)
> If we allow it to enter PT again, we might lose some messages to node X from 
> other nodes, but node X will never know about it. Some state in its memory 
> might still remain from a process thinking that the messages will be 
> delivered later, so some invariants might break.
> To prevent such a situation, the node must be refused entry, namely, a 
> connection must be terminated on a handshake attempt. This has to be done 
> both in {{RecoveryServerHandshakeManager}} and 
> {{{}RecoveryClientHandshakeManager{}}}.
> When a node is refused a connection attempt, the refusing node must first 
> send an explaining message (like 'your ID is stale') and then close the 
> physical connection.
> The refused node must take measures to refresh its identity (like initiating 
> a critical failure using a Failure Handler to reboot).
> It seems that we do not need a consensus of the whole cluster (on the 
> decision that a node has left and should never be allowed to join again) as 
> messaging communications are point-to-point. SWIM 'half consensus' should be 
> enough.
> A subtle thing is how we persist the fact that some node ID is stale. For 
> starters, we could make this information volatile (only keep it in memory), 
> but later we could record this information using CMG.
> Please do not confuse this issue with IGNITE-18685 which was caused by a 
> rejected attempt of fixing same problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to