[jira] [Updated] (IGNITE-18712) Do not allow a node excluded from Physical Topology to enter the topology again

Roman Puchkovskiy (Jira) Mon, 06 Feb 2023 00:38:06 -0800


     [ 
https://issues.apache.org/jira/browse/IGNITE-18712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Roman Puchkovskiy updated IGNITE-18712:
---------------------------------------
    Description: 
The following scenario is possible:
 # Node X is a part of PT
 # Its network cable gets unplugged, but the node X keeps being alive
 # After proper timeouts, other nodes remove the node X from PT, so their 
{{MessagingServices}} drop messages still not delivered to node X
 # The network cable gets plugged again, so the node X attempts to enter the PT 
with the same old ID (aka Launch ID)

If we allow it to enter PT again, we might lose some messages to node X from 
other nodes, but node X will never know about it. Some state in its memory 
might still remain from a process thinking that the messages will be delivered 
later, so some invariants might break.

To prevent such a situation, the node must be refused entry, namely, a 
connection must be terminated on a handshake attempt. This has to be done both 
in {{RecoveryServerHandshakeManager}} and 
{{{}RecoveryClientHandshakeManager{}}}.

When a node is refused a connection attempt, the refusing node must first send 
an explaining message (like 'your ID is stale') and then close the physical 
connection.

The refused node must take measures to refresh its identity (like initiating a 
critical failure using a Failure Handler).

It seems that we do not need a consensus of the whole cluster as messaging 
communications are point-to-point. SWIM 'half consensus' should be enough.

A subtle thing is how we persist the fact that some node ID is stale. For 
starters, we could make this information volatile (only keep it in memory), but 
later we could record this information using CMG.

Please do not confuse this issue with IGNITE-18685 which was caused by a 
rejected attempt of fixing same problem.

  was:
The following scenario is possible:
 # Node X is a part of PT
 # Its network cable gets unplugged, but the node X keeps being alive
 # After proper timeouts, other nodes remove the node X from PT, so their 
{{MessagingServices}} drop messages still not delivered to node X
 # The network cable gets plugged again, so the node X attempts to enter the PT 
with the same old ID (aka Launch ID)

If we allow it to enter PT again, we might lose some messages to node X from 
other nodes, but node X will never know about it. Some state in its memory 
might still remain from a process thinking that the messages will be delivered 
later, so some invariants might break.

To prevent such a situation, the node must be refused entry, namely, a 
connection must be terminated on a handshake attempt. This has to be done both 
in {{RecoveryServerHandshakeManager}} and 
{{{}RecoveryClientHandshakeManager{}}}.

When a node is refused a connection attempt, the refusing node must first send 
an explaining message (like 'your ID is stale') and then close the physical 
connection.

The refused node must take measures to refresh its identity (like initiating a 
critical failure using a Failure Handler).

A subtle thing is how we persist the fact that some node ID is stale. For 
starters, we could make this information volatile (only keep it in memory), but 
later we could record this information using CMG.

Please do not confuse this issue with IGNITE-18685 which was caused by a 
rejected attempt of fixing same problem.


> Do not allow a node excluded from Physical Topology to enter the topology 
> again
> -------------------------------------------------------------------------------
>
>                 Key: IGNITE-18712
>                 URL: https://issues.apache.org/jira/browse/IGNITE-18712
>             Project: Ignite
>          Issue Type: New Feature
>            Reporter: Roman Puchkovskiy
>            Assignee: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>
> The following scenario is possible:
>  # Node X is a part of PT
>  # Its network cable gets unplugged, but the node X keeps being alive
>  # After proper timeouts, other nodes remove the node X from PT, so their 
> {{MessagingServices}} drop messages still not delivered to node X
>  # The network cable gets plugged again, so the node X attempts to enter the 
> PT with the same old ID (aka Launch ID)
> If we allow it to enter PT again, we might lose some messages to node X from 
> other nodes, but node X will never know about it. Some state in its memory 
> might still remain from a process thinking that the messages will be 
> delivered later, so some invariants might break.
> To prevent such a situation, the node must be refused entry, namely, a 
> connection must be terminated on a handshake attempt. This has to be done 
> both in {{RecoveryServerHandshakeManager}} and 
> {{{}RecoveryClientHandshakeManager{}}}.
> When a node is refused a connection attempt, the refusing node must first 
> send an explaining message (like 'your ID is stale') and then close the 
> physical connection.
> The refused node must take measures to refresh its identity (like initiating 
> a critical failure using a Failure Handler).
> It seems that we do not need a consensus of the whole cluster as messaging 
> communications are point-to-point. SWIM 'half consensus' should be enough.
> A subtle thing is how we persist the fact that some node ID is stale. For 
> starters, we could make this information volatile (only keep it in memory), 
> but later we could record this information using CMG.
> Please do not confuse this issue with IGNITE-18685 which was caused by a 
> rejected attempt of fixing same problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-18712) Do not allow a node excluded from Physical Topology to enter the topology again

Reply via email to