Roman Puchkovskiy created IGNITE-18712:
------------------------------------------
Summary: Do not allow a node excluded from Physical Topology to
enter the topology again
Key: IGNITE-18712
URL: https://issues.apache.org/jira/browse/IGNITE-18712
Project: Ignite
Issue Type: New Feature
Reporter: Roman Puchkovskiy
Assignee: Roman Puchkovskiy
Fix For: 3.0.0-beta2
The following scenario is possible:
# Node X is a part of PT
# Its network cable gets unplugged, but the node X keeps being alive
# After proper timeouts, other nodes remove the node X from PT, so their
{{MessagingServices}} drop messages still not delivered to node X
# The network cable gets plugged again, so the node X attempts to enter the PT
with the same old ID (aka Launch ID)
If we allow it to enter PT again, we might lose some messages to node X from
other nodes, but node X will never know about it. Some state in its memory
might still remain from a process thinking that the messages will be delivered
later, so some invariants might break.
To prevent such a situation, the node must be refused entry, namely, a
connection must be terminated on a handshake attempt. This has to be done both
in {{RecoveryServerHandshakeManager}} and
{{{}RecoveryClientHandshakeManager{}}}.
When a node is refused a connection attempt, the refusing node must first send
an explaining message (like 'your ID is stale') and then close the physical
connection.
The refused node must take measures to refresh its identity (like initiating a
critical failure using a Failure Handler).
A subtle thing is how we persist the fact that some node ID is stale. For
starters, we could make this information volatile (only keep it in memory), but
later we could record this information using CMG.
Please do not confuse this issue with IGNITE-18685 which was caused by a
rejected attempt of fixing same problem.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)