denis-chudov commented on code in PR #1644: URL: https://github.com/apache/ignite-3/pull/1644#discussion_r1100620535
########## modules/raft/tech-notes/rebalance.md: ########## @@ -0,0 +1,71 @@ +## Introduction +Since the last rebalance design we made some significant decisions and architecture updates: +- Transaction protocol introduce the new cluster/group wide roles like Tracker (Placement Driver), LeaseHolder (Primary replica) and etc. (see [Transaction Protocol](https://cwiki.apache.org/confluence/display/IGNITE/IEP-91%3A+Transaction+protocol)) +- The protocol of replication itself will be extended to the pluggable abstraction, instead of RAFT-only one. (TODO: add the document link, when it will be ready to share). Review Comment: shouldn't we add a link to ticket along with TODO? ########## modules/raft/tech-notes/rebalance.md: ########## @@ -0,0 +1,71 @@ +## Introduction +Since the last rebalance design we made some significant decisions and architecture updates: +- Transaction protocol introduce the new cluster/group wide roles like Tracker (Placement Driver), LeaseHolder (Primary replica) and etc. (see [Transaction Protocol](https://cwiki.apache.org/confluence/display/IGNITE/IEP-91%3A+Transaction+protocol)) Review Comment: ```suggestion - Transaction protocol introduced the new cluster/group wide roles like Tracker (Placement Driver), LeaseHolder (Primary replica) and etc. (see [Transaction Protocol](https://cwiki.apache.org/confluence/display/IGNITE/IEP-91%3A+Transaction+protocol)) ``` ########## modules/raft/tech-notes/rebalance.md: ########## @@ -0,0 +1,71 @@ +## Introduction +Since the last rebalance design we made some significant decisions and architecture updates: +- Transaction protocol introduce the new cluster/group wide roles like Tracker (Placement Driver), LeaseHolder (Primary replica) and etc. (see [Transaction Protocol](https://cwiki.apache.org/confluence/display/IGNITE/IEP-91%3A+Transaction+protocol)) +- The protocol of replication itself will be extended to the pluggable abstraction, instead of RAFT-only one. (TODO: add the document link, when it will be ready to share). +- New distribution zones layer were introduced (see [Distribution Zones](https://cwiki.apache.org/confluence/display/IGNITE/IEP-97%3A+Distribution+Zones)). So, the assignment property is not the part of table configuration anymore. Review Comment: ```suggestion - New distribution zones layer was introduced (see [Distribution Zones](https://cwiki.apache.org/confluence/display/IGNITE/IEP-97%3A+Distribution+Zones)). So, the assignment property is not the part of table configuration anymore. ``` ########## modules/raft/tech-notes/rebalance.md: ########## @@ -0,0 +1,71 @@ +## Introduction +Since the last rebalance design we made some significant decisions and architecture updates: +- Transaction protocol introduce the new cluster/group wide roles like Tracker (Placement Driver), LeaseHolder (Primary replica) and etc. (see [Transaction Protocol](https://cwiki.apache.org/confluence/display/IGNITE/IEP-91%3A+Transaction+protocol)) +- The protocol of replication itself will be extended to the pluggable abstraction, instead of RAFT-only one. (TODO: add the document link, when it will be ready to share). +- New distribution zones layer were introduced (see [Distribution Zones](https://cwiki.apache.org/confluence/display/IGNITE/IEP-97%3A+Distribution+Zones)). So, the assignment property is not the part of table configuration anymore. + +These changes incline us to the thoughts, that we need to revise the current rebalance flow, because it doesn't suite to the new architecture anymore in general and doesn't use the power of new abstractions on the other side. + +## Rebalance triggers +The simplest way to start the journey to the new design: look at the real cases and try to draw the whole picture. + +We still has the number of triggers, which trigger a rebalance: +- Change the number of replicas for any distribution zone. +- Change the number of partitions for any distribution zone. +- Change the distribution zone data nodes composition. + +Let's take the first one to draw the whole rebalance picture. +## Change the number of replicas + + +### 1. Update of pending/planned zone assignments +Update of `zoneId.assignments.*` keys can be expressed by the following pseudo-code: +``` +var newAssignments = calculateZoneAssignments() + +metastoreInvoke: // atomic metastore call through multi-invoke api + if empty(zoneId.assignments.change.revision) || zoneId.assignments.change.revision < configurationUpdate.revision: + if empty(zoneId.assignments.pending) && zoneId.assignments.stable != newAssignments: + zoneId.assignments.pending = newAssignments + zoneId.assignments.change.revision = configurationUpdate.revision + else: + if zoneId.assignments.pending != newAssignments + zoneId.assignments.planned = newAssignments + zoneId.assignments.change.revision = configurationUpdate.revision + else + remove(zoneId.assignments.planned) + else: + skip +``` +### 2. Wait for the all needed replicas to start rebalance routine +It looks like we can reuse the mechanism of AwaitReplicaRequest: +- PrimaryReplica send an AwaitReplicaRequest to all new replicas. +- When all answers received, rebalance can be started + +### 3. Replication group rebalance +Let's zoom to the details of PrimaryReplica and replication group communication for the RAFT case: + + +* Any replication member can has in-flight RO transactions. But at the same time, if it is not a member of new topology, it will not receive updates of safe time, so these RO transactions will be failed by timeout. It is not an issue for correctness, but we want to optimise this in future to avoid redundant transaction fails and retries [TODO]. Review Comment: should this be a part of transaction protocol improvements or rebalance? if the former, maybe referring to tx protocol is needed, if the latter, shouldn't we add a link to a ticket? ########## modules/raft/tech-notes/rebalance.md: ########## @@ -0,0 +1,71 @@ +## Introduction +Since the last rebalance design we made some significant decisions and architecture updates: +- Transaction protocol introduce the new cluster/group wide roles like Tracker (Placement Driver), LeaseHolder (Primary replica) and etc. (see [Transaction Protocol](https://cwiki.apache.org/confluence/display/IGNITE/IEP-91%3A+Transaction+protocol)) +- The protocol of replication itself will be extended to the pluggable abstraction, instead of RAFT-only one. (TODO: add the document link, when it will be ready to share). +- New distribution zones layer were introduced (see [Distribution Zones](https://cwiki.apache.org/confluence/display/IGNITE/IEP-97%3A+Distribution+Zones)). So, the assignment property is not the part of table configuration anymore. + +These changes incline us to the thoughts, that we need to revise the current rebalance flow, because it doesn't suite to the new architecture anymore in general and doesn't use the power of new abstractions on the other side. + +## Rebalance triggers +The simplest way to start the journey to the new design: look at the real cases and try to draw the whole picture. + +We still has the number of triggers, which trigger a rebalance: +- Change the number of replicas for any distribution zone. +- Change the number of partitions for any distribution zone. +- Change the distribution zone data nodes composition. + +Let's take the first one to draw the whole rebalance picture. +## Change the number of replicas + + +### 1. Update of pending/planned zone assignments +Update of `zoneId.assignments.*` keys can be expressed by the following pseudo-code: +``` +var newAssignments = calculateZoneAssignments() + +metastoreInvoke: // atomic metastore call through multi-invoke api + if empty(zoneId.assignments.change.revision) || zoneId.assignments.change.revision < configurationUpdate.revision: + if empty(zoneId.assignments.pending) && zoneId.assignments.stable != newAssignments: + zoneId.assignments.pending = newAssignments + zoneId.assignments.change.revision = configurationUpdate.revision + else: + if zoneId.assignments.pending != newAssignments + zoneId.assignments.planned = newAssignments + zoneId.assignments.change.revision = configurationUpdate.revision + else + remove(zoneId.assignments.planned) + else: + skip +``` +### 2. Wait for the all needed replicas to start rebalance routine +It looks like we can reuse the mechanism of AwaitReplicaRequest: +- PrimaryReplica send an AwaitReplicaRequest to all new replicas. +- When all answers received, rebalance can be started + +### 3. Replication group rebalance +Let's zoom to the details of PrimaryReplica and replication group communication for the RAFT case: + + +* Any replication member can has in-flight RO transactions. But at the same time, if it is not a member of new topology, it will not receive updates of safe time, so these RO transactions will be failed by timeout. It is not an issue for correctness, but we want to optimise this in future to avoid redundant transaction fails and retries [TODO]. + +#### 3.1 Notification about the new leader and rebalance events +Current rebalance algorithm based on the metastore invokes and local rebalance listeners. + +But for the new one we have an idea, which doesn't need the metastore at all: +- On rebalanceDone/rebalanceError/leaderElected events the local event listener send a message to PrimaryReplica with the description of event +- If PrimaryReplica is not available - we should retry send, until leader didn't find himself outdated (in this case, new leader will send leaderElected event to PrimaryReplica and receives the rebalance request again. + +### 4. Stop the redundant replicas and update replicas clients +Here we can accidentally stop PrimaryReplica, so we need to use the algorithm of a graceful PrimaryReplica transfer, if needed. + +Also, we need to update the replication protocol clients (RaftGroupService, for example) on each Replica. + +### Failover logic +The main idea of failover process: every rebalance request PlacementDriver->PrimaryReplica or PrimaryReplica->ReplicationGroup must be idempotent. So, redundant request in the worst case should be just answered by positive answer (just like rebalance is already done). + +After that we can prepare the following logic: +- On every new PD leader elected - it must check the direct value (not the locally cached one) of `zoneId.assignment.pending` keys and send RebalanceRequest to needed PrimaryReplicas and then listen updates from the last revision. +- On every PrimaryReplica reelection by PD it must send the new RebalanceRequest to PrimaryReplica, if pending key is not empty. +- On every leader reelection (for the leader oriented protocols) inside the replication group - leader send leaderElected event to PrimaryReplica, which force PrimaryReplica to send RebalanceRequest to RAFT leader again. Review Comment: seems that RAFT shouldn't be mentioned here - as the first part of the sentence is about leader oriented protocols in general ########## modules/raft/tech-notes/src/flow.puml: ########## @@ -0,0 +1,29 @@ +@startuml flow +title General Rebalance Flow (diagram 1) + +skinparam maxMessageSize 400 +skinparam defaultFontSize 12 + +User -> DistributionConfiguration : Change a number of replicas + +DistributionConfiguration --> DistributionZoneManager : Receives an update through the distribution configuration listener. + +DistributionZoneManager -> Metastore : Calculate new assignments based on the current datanodes and put the result of calculations to **zoneId.assignments.pending**/**planned** key [see 1] +note left +this put must be covered by the similar logic +which we have in the current rebalance +to prevent the metastore call from multiple nodes +due to the fact, because each node has DistributionZoneManager +and listens configuration updates +end note + +Metastore --> DistributionZoneManager : Receives an update of **zoneId.assignments.pending** key and starts replica server if needed +Metastore --> PlacementDriver : Receives an update of **zoneId.assignments.pending** key. +PlacementDriver -> PartitionPrimaryReplica : Send a RebalanceRequest to PrimaryReplica for the rebalance of its group +PartitionPrimaryReplica -> PartitionPrimaryReplica: Await for all replicas start [see 2] +PartitionPrimaryReplica -> PartitionPrimaryReplica : Process replication group update [see 3 and separate diagram 2] +PartitionPrimaryReplica -> PlacementDriver : Notify about rebalance done. PlacementDriver updates itself's cache for rebalanced group with the addresses of the new members. +PlacementDriver -> Metastore : Put the **zoneId.assignments.stable** key +Metastore --> DistributionZoneManager : Receives the update about **zoneId.assignments.stable** update and stop the unneeded replication group members on the current node, if needed [see 4] Review Comment: `[see 4]` - but the mentioned paragraph doesn't tell anything about stopping unneeded group members ########## modules/raft/tech-notes/rebalance.md: ########## @@ -0,0 +1,71 @@ +## Introduction +Since the last rebalance design we made some significant decisions and architecture updates: +- Transaction protocol introduce the new cluster/group wide roles like Tracker (Placement Driver), LeaseHolder (Primary replica) and etc. (see [Transaction Protocol](https://cwiki.apache.org/confluence/display/IGNITE/IEP-91%3A+Transaction+protocol)) +- The protocol of replication itself will be extended to the pluggable abstraction, instead of RAFT-only one. (TODO: add the document link, when it will be ready to share). +- New distribution zones layer were introduced (see [Distribution Zones](https://cwiki.apache.org/confluence/display/IGNITE/IEP-97%3A+Distribution+Zones)). So, the assignment property is not the part of table configuration anymore. + +These changes incline us to the thoughts, that we need to revise the current rebalance flow, because it doesn't suite to the new architecture anymore in general and doesn't use the power of new abstractions on the other side. + +## Rebalance triggers +The simplest way to start the journey to the new design: look at the real cases and try to draw the whole picture. + +We still has the number of triggers, which trigger a rebalance: +- Change the number of replicas for any distribution zone. +- Change the number of partitions for any distribution zone. +- Change the distribution zone data nodes composition. + +Let's take the first one to draw the whole rebalance picture. +## Change the number of replicas Review Comment: This document structure implies that for now we are considering only the first trigger, is it so? ########## modules/raft/tech-notes/src/primaryReplica.puml: ########## @@ -0,0 +1,34 @@ +@startuml primaryReplica +title PrimaryReplica and Replication Group communication (diagram 1) + +skinparam maxMessageSize 400 +skinparam defaultFontSize 12 + +participant PlacementDriver +participant PrimaryReplica + +participant Replica1 [ +Replica1 +Leader for term 1 +] + +participant Replica2 [ +Replica2 +Leader for term 2 +] + +PrimaryReplica -> Replica1 : Send a changePeersAsync request (node is the leader at the moment) +Replica1 -> Replica1 : Leader was stepped down and new leader election will start. +Replica2 -> Replica2 : Current node elected as a leader. +Replica2 -> PrimaryReplica : Send a message about the new leader elected. [see 3.1 details] +PrimaryReplica -> PrimaryReplica : Check local state for ongoing rebalance for the replica group. Review Comment: Actually, replica1 should transfer leader to primary replica, to make the leader colocated with primary, instead of just stepping down. But as far as i can see, this diagram is correct in terms of rebalancing and maybe describes the flow better. It's up to you to change anything here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
