[
https://issues.apache.org/jira/browse/IGNITE-18712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman Puchkovskiy updated IGNITE-18712:
---------------------------------------
Description:
The following scenario is possible:
# Node X is a part of PT
# Its network cable gets unplugged, but the node X keeps being alive
# After proper timeouts, other nodes remove the node X from PT, so their
{{MessagingServices}} drop messages still not delivered to node X
# The network cable gets plugged again, so the node X attempts to enter the PT
with the same old ID (aka Launch ID)
If we allow it to enter PT again, we might lose some messages to node X from
other nodes, but node X will never know about it. Some state in its memory
might still remain from a process thinking that the messages will be delivered
later, so some invariants might break.
To prevent such a situation, the node must be refused entry, namely, a
connection must be terminated on a handshake attempt. This has to be done both
in {{RecoveryServerHandshakeManager}} and
{{{}RecoveryClientHandshakeManager{}}}.
When a node is refused a connection attempt, the refusing node must first send
an explaining message (like 'your ID is stale') and then close the physical
connection.
The refused node must take measures to refresh its identity (like initiating a
critical failure using a Failure Handler).
It seems that we do not need a consensus of the whole cluster as messaging
communications are point-to-point. SWIM 'half consensus' should be enough.
A subtle thing is how we persist the fact that some node ID is stale. For
starters, we could make this information volatile (only keep it in memory), but
later we could record this information using CMG.
Please do not confuse this issue with IGNITE-18685 which was caused by a
rejected attempt of fixing same problem.
was:
The following scenario is possible:
# Node X is a part of PT
# Its network cable gets unplugged, but the node X keeps being alive
# After proper timeouts, other nodes remove the node X from PT, so their
{{MessagingServices}} drop messages still not delivered to node X
# The network cable gets plugged again, so the node X attempts to enter the PT
with the same old ID (aka Launch ID)
If we allow it to enter PT again, we might lose some messages to node X from
other nodes, but node X will never know about it. Some state in its memory
might still remain from a process thinking that the messages will be delivered
later, so some invariants might break.
To prevent such a situation, the node must be refused entry, namely, a
connection must be terminated on a handshake attempt. This has to be done both
in {{RecoveryServerHandshakeManager}} and
{{{}RecoveryClientHandshakeManager{}}}.
When a node is refused a connection attempt, the refusing node must first send
an explaining message (like 'your ID is stale') and then close the physical
connection.
The refused node must take measures to refresh its identity (like initiating a
critical failure using a Failure Handler).
A subtle thing is how we persist the fact that some node ID is stale. For
starters, we could make this information volatile (only keep it in memory), but
later we could record this information using CMG.
Please do not confuse this issue with IGNITE-18685 which was caused by a
rejected attempt of fixing same problem.
> Do not allow a node excluded from Physical Topology to enter the topology
> again
> -------------------------------------------------------------------------------
>
> Key: IGNITE-18712
> URL: https://issues.apache.org/jira/browse/IGNITE-18712
> Project: Ignite
> Issue Type: New Feature
> Reporter: Roman Puchkovskiy
> Assignee: Roman Puchkovskiy
> Priority: Major
> Labels: ignite-3
> Fix For: 3.0.0-beta2
>
>
> The following scenario is possible:
> # Node X is a part of PT
> # Its network cable gets unplugged, but the node X keeps being alive
> # After proper timeouts, other nodes remove the node X from PT, so their
> {{MessagingServices}} drop messages still not delivered to node X
> # The network cable gets plugged again, so the node X attempts to enter the
> PT with the same old ID (aka Launch ID)
> If we allow it to enter PT again, we might lose some messages to node X from
> other nodes, but node X will never know about it. Some state in its memory
> might still remain from a process thinking that the messages will be
> delivered later, so some invariants might break.
> To prevent such a situation, the node must be refused entry, namely, a
> connection must be terminated on a handshake attempt. This has to be done
> both in {{RecoveryServerHandshakeManager}} and
> {{{}RecoveryClientHandshakeManager{}}}.
> When a node is refused a connection attempt, the refusing node must first
> send an explaining message (like 'your ID is stale') and then close the
> physical connection.
> The refused node must take measures to refresh its identity (like initiating
> a critical failure using a Failure Handler).
> It seems that we do not need a consensus of the whole cluster as messaging
> communications are point-to-point. SWIM 'half consensus' should be enough.
> A subtle thing is how we persist the fact that some node ID is stale. For
> starters, we could make this information volatile (only keep it in memory),
> but later we could record this information using CMG.
> Please do not confuse this issue with IGNITE-18685 which was caused by a
> rejected attempt of fixing same problem.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)