[
https://issues.apache.org/jira/browse/IGNITE-28751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mikhail Petrov updated IGNITE-28751:
------------------------------------
Labels: IEP-132 ise (was: )
> Refactor TCP Discovery SPI joining node validation
> --------------------------------------------------
>
> Key: IGNITE-28751
> URL: https://issues.apache.org/jira/browse/IGNITE-28751
> Project: Ignite
> Issue Type: Task
> Reporter: Mikhail Petrov
> Assignee: Mikhail Petrov
> Priority: Major
> Labels: IEP-132, ise
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
> Motivation.
> The following problems arise during the implementation of the RU mechanism:
> 1. We need to validate joining nodes and ensure that the cluster does not
> contain nodes with more than two different product versions.
> 2. We need to ensure that the cluster contains nodes with only one version
> when the RU process is about to complete.
> Currently, join validation logic can be implemented in the
> GridComponent#validateNode() method, which is called for all joining nodes on
> the coordinator.
> However, the node join process in the TCP Discovery SPI consists of three
> phases:
> 1. node validation (see RingMessageWorker#processJoinRequestMessage)
> 2. node join process (data exchange between the joining node and the cluster)
> (see RingMessageWorker#processNodeAddedMessage /
> processNodeAddFinishedMessage)
> 3. node join completion (the node is added to the discovery cache and becomes
> visible to all Ignite components) (see DiscoverySpiListener#onDiscovery)
> *First problem*
> While the joining node is in phase 2, the RU processor does not observe it
> via the GridDiscoveryManager#remoteNodes or similar methods and cannot
> properly check the current topology.
> *It is proposed* to introduce a new method in DiscoveryManager that returns
> remote nodes by directly querying the underlying Discovery SPI. In fact, we
> have all the necessary mechanisms for this—they are simply not used (see
> DiscoverySpi#getRemoteNodes).
> *Second Problem*
> To properly validate joining nodes, the RU processor must track nodes that
> have passed the RU processor's validation but are still in Phase 1 (the node
> is validated by another Ignite component).
> Currently, a joining node can be forced to leave the cluster after being
> validated by the RU processor. Ignite components are not notified of this
> (see RingMessageWorker#nodeCheckError). As a result, the RU processor
> 1. validates the new node
> 2. caches it as a node about to join (so that it is taken into account when
> validating subsequent joining nodes)
> 3. cannot determine whether the joining node is still in the process of
> joining or has just been kicked out from the cluster
> *It is proposed* to rise the existing IgniteNodeValidationFailedEvent in all
> cases where a joining node is forced to leave the cluster due to a validation
> error.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)