kgusakov commented on code in PR #1644:
URL: https://github.com/apache/ignite-3/pull/1644#discussion_r1102842981


##########
modules/raft/tech-notes/rebalance.md:
##########
@@ -0,0 +1,71 @@
+## Introduction
+Since the last rebalance design we made some significant decisions and 
architecture updates:
+- Transaction protocol introduce the new cluster/group wide roles like Tracker 
(Placement Driver), LeaseHolder (Primary replica) and etc. (see [Transaction 
Protocol](https://cwiki.apache.org/confluence/display/IGNITE/IEP-91%3A+Transaction+protocol))
+- The protocol of replication itself will be extended to the pluggable 
abstraction, instead of RAFT-only one. (TODO: add the document link, when it 
will be ready to share).
+- New distribution zones layer were introduced (see [Distribution 
Zones](https://cwiki.apache.org/confluence/display/IGNITE/IEP-97%3A+Distribution+Zones)).
 So, the assignment property is not the part of table configuration anymore.
+
+These changes incline us to the thoughts, that we need to revise the current 
rebalance flow, because it doesn't suite to the new architecture anymore in 
general and doesn't use the power of new abstractions on the other side.
+
+## Rebalance triggers
+The simplest way to start the journey to the new design: look at the real 
cases and try to draw the whole picture.
+
+We still has the number of triggers, which trigger a rebalance:
+- Change the number of replicas for any distribution zone.
+- Change the number of partitions for any distribution zone.
+- Change the distribution zone data nodes composition.
+
+Let's take the first one to draw the whole rebalance picture.
+## Change the number of replicas
+![](images/flow.svg)
+
+### 1. Update of pending/planned zone assignments
+Update of `zoneId.assignments.*` keys can be expressed by the following 
pseudo-code:
+```
+var newAssignments = calculateZoneAssignments()
+
+metastoreInvoke: // atomic metastore call through multi-invoke api
+    if empty(zoneId.assignments.change.revision) || 
zoneId.assignments.change.revision < configurationUpdate.revision:
+        if empty(zoneId.assignments.pending) && zoneId.assignments.stable != 
newAssignments:
+            zoneId.assignments.pending = newAssignments 
+            zoneId.assignments.change.revision = configurationUpdate.revision
+        else:
+            if zoneId.assignments.pending != newAssignments
+                zoneId.assignments.planned = newAssignments
+                zoneId.assignments.change.revision = 
configurationUpdate.revision
+            else
+                remove(zoneId.assignments.planned)
+    else:
+        skip
+```
+### 2. Wait for the all needed replicas to start rebalance routine
+It looks like we can reuse the mechanism of AwaitReplicaRequest:
+- PrimaryReplica send an AwaitReplicaRequest to all new replicas.
+- When all answers received, rebalance can be started 
+
+### 3. Replication group rebalance
+Let's zoom to the details of PrimaryReplica and replication group 
communication for the RAFT case:
+![](images/primaryReplica.svg)
+
+* Any replication member can has in-flight RO transactions. But at the same 
time, if it is not a member of new topology, it will not receive updates of 
safe time, so these RO transactions will be failed by timeout. It is not an 
issue for correctness, but we want to optimise this in future to avoid 
redundant transaction fails and retries [TODO].

Review Comment:
   Change the sentence and add a link to the transaction protocol.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to