[ 
https://issues.apache.org/jira/browse/IGNITE-28751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Petrov updated IGNITE-28751:
------------------------------------
    Fix Version/s: 2.19

> Refactor TCP Discovery SPI joining node validation
> --------------------------------------------------
>
>                 Key: IGNITE-28751
>                 URL: https://issues.apache.org/jira/browse/IGNITE-28751
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Mikhail Petrov
>            Assignee: Mikhail Petrov
>            Priority: Major
>              Labels: IEP-132, ise
>             Fix For: 2.19
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Motivation.
> The following problems arise during the implementation of the RU mechanism:
> 1. We need to validate joining nodes and ensure that the cluster does not 
> contain nodes with more than two different product versions. 
> 2. We need to ensure that the cluster contains nodes with only one version 
> when the RU process is about to complete.
> Currently, join validation logic can be implemented in the 
> GridComponent#validateNode() method, which is called for all joining nodes on 
> the coordinator.
> However, the node join process in the TCP Discovery SPI consists of three 
> phases:
> 1. node validation (see RingMessageWorker#processJoinRequestMessage)
> 2. node join process (data exchange between the joining node and the cluster) 
> (see RingMessageWorker#processNodeAddedMessage / 
> processNodeAddFinishedMessage)
> 3. node join completion (the node is added to the discovery cache and becomes 
> visible to all Ignite components) (see DiscoverySpiListener#onDiscovery)
> *First problem* 
> While the joining node is in phase 2, the RU processor does not observe it 
> via the GridDiscoveryManager#remoteNodes or similar methods and cannot 
> properly check the current topology.
> *It is proposed* to introduce a new method in DiscoveryManager that returns 
> remote nodes by directly querying the underlying Discovery SPI. In fact, we 
> have all the necessary mechanisms for this—they are simply not used (see 
> DiscoverySpi#getRemoteNodes).
> *Second Problem*
> To properly validate joining nodes, the RU processor must track nodes that 
> have passed the RU processor's validation but are still in Phase 1 (the node 
> is validated by another Ignite component).
> Currently, a joining node can be forced to leave the cluster after being 
> validated by the RU processor. Ignite components are not notified of this 
> (see RingMessageWorker#nodeCheckError). As a result, the RU processor
> 1. validates the new node
> 2. caches it as a node about to join (so that it is taken into account when 
> validating subsequent joining nodes)
> 3. cannot determine whether the joining node is still in the process of 
> joining or has just been kicked out from the cluster
> *It is proposed* to rise the existing IgniteNodeValidationFailedEvent in all 
> cases where a joining node is forced to leave the cluster due to a validation 
> error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to