This is an automated email from the ASF dual-hosted git repository.
sk0x50 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/ignite-3.git
The following commit(s) were added to refs/heads/main by this push:
new 15ae87d535 IGNITE-18634 Document for redesign of rebalance process.
Fixes #1644
15ae87d535 is described below
commit 15ae87d535240aa5b7438673677eefb67c4d363a
Author: Kirill Gusakov <[email protected]>
AuthorDate: Thu Feb 16 20:57:07 2023 +0200
IGNITE-18634 Document for redesign of rebalance process. Fixes #1644
Signed-off-by: Slava Koptilin <[email protected]>
---
.../distribution-zones/tech-notes/images/flow.svg | 1 +
.../tech-notes/images/primaryReplica.svg | 1 +
modules/distribution-zones/tech-notes/rebalance.md | 72 ++++++++++++++++++++++
.../distribution-zones/tech-notes/src/flow.puml | 29 +++++++++
.../tech-notes/src/primaryReplica.puml | 34 ++++++++++
5 files changed, 137 insertions(+)
diff --git a/modules/distribution-zones/tech-notes/images/flow.svg
b/modules/distribution-zones/tech-notes/images/flow.svg
new file mode 100644
index 0000000000..06bb3d5a18
--- /dev/null
+++ b/modules/distribution-zones/tech-notes/images/flow.svg
@@ -0,0 +1 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?><svg
xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"
contentStyleType="text/css" height="648px" preserveAspectRatio="none"
style="width:2216px;height:648px;background:#FFFFFF;" version="1.1" viewBox="0
0 2216 648" width="2216px" zoomAndPan="magnify"><defs/><g><text fill="#000000"
font-family="sans-serif" font-size="14" font-weight="bold"
lengthAdjust="spacing" textLength="258" x="978.5" y="28.5352">Genera [...]
\ No newline at end of file
diff --git a/modules/distribution-zones/tech-notes/images/primaryReplica.svg
b/modules/distribution-zones/tech-notes/images/primaryReplica.svg
new file mode 100644
index 0000000000..d3c1ab27b0
--- /dev/null
+++ b/modules/distribution-zones/tech-notes/images/primaryReplica.svg
@@ -0,0 +1 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?><svg
xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"
contentStyleType="text/css" height="480px" preserveAspectRatio="none"
style="width:1400px;height:480px;background:#FFFFFF;" version="1.1" viewBox="0
0 1400 480" width="1400px" zoomAndPan="magnify"><defs/><g><text fill="#000000"
font-family="sans-serif" font-size="14" font-weight="bold"
lengthAdjust="spacing" textLength="477" x="461" y="28.5352">PrimaryR [...]
\ No newline at end of file
diff --git a/modules/distribution-zones/tech-notes/rebalance.md
b/modules/distribution-zones/tech-notes/rebalance.md
new file mode 100644
index 0000000000..3f33f8869f
--- /dev/null
+++ b/modules/distribution-zones/tech-notes/rebalance.md
@@ -0,0 +1,72 @@
+## Introduction
+Since the last rebalance design we made some significant decisions and
architecture updates:
+- Transaction protocol introduced the new cluster/group wide roles like
Tracker (Placement Driver), LeaseHolder (Primary replica) and etc. (see
[Transaction
Protocol](https://cwiki.apache.org/confluence/display/IGNITE/IEP-91%3A+Transaction+protocol))
+- The protocol of replication itself will be extended to the pluggable
abstraction, instead of RAFT-only one. (TODO: IGNITE-18775 add the document
link, when it will be ready to share).
+- New distribution zones layer was introduced (see [Distribution
Zones](https://cwiki.apache.org/confluence/display/IGNITE/IEP-97%3A+Distribution+Zones)).
So, the assignment property is not the part of table configuration anymore.
+
+These changes incline us to the thoughts, that we need to revise the current
rebalance flow, because it doesn't suite to the new architecture anymore in
general and doesn't use the power of new abstractions on the other side.
+
+## Rebalance triggers
+The simplest way to start the journey to the new design: look at the real
cases and try to draw the whole picture.
+
+We still has the number of triggers, which trigger a rebalance:
+- Change the number of replicas for any distribution zone.
+- Change the number of partitions for any distribution zone.
+- Change the distribution zone data nodes composition.
+
+Let's take the first one to draw the whole rebalance picture.
+## Change the number of replicas
+
+
+### 1. Update of pending/planned zone assignments
+Update of `zoneId.assignments.*` keys can be expressed by the following
pseudo-code:
+```
+var newAssignments = calculateZoneAssignments()
+
+metastoreInvoke: // atomic metastore call through multi-invoke api
+ if empty(zoneId.assignments.change.revision) ||
zoneId.assignments.change.revision < configurationUpdate.revision:
+ if empty(zoneId.assignments.pending) && zoneId.assignments.stable !=
newAssignments:
+ zoneId.assignments.pending = newAssignments
+ zoneId.assignments.change.revision = configurationUpdate.revision
+ else:
+ if zoneId.assignments.pending != newAssignments
+ zoneId.assignments.planned = newAssignments
+ zoneId.assignments.change.revision =
configurationUpdate.revision
+ else
+ remove(zoneId.assignments.planned)
+ else:
+ skip
+```
+### 2. Wait for the all needed replicas to start rebalance routine
+It looks like we can reuse the mechanism of AwaitReplicaRequest:
+- PrimaryReplica send an AwaitReplicaRequest to all new replicas.
+- When all answers received, rebalance can be started
+
+### 3. Replication group rebalance
+Let's zoom to the details of PrimaryReplica and replication group
communication for the RAFT case:
+
+
+* Any replication member can has in-flight RO transactions. But at the same
time, if it is not a member of new topology, it will not receive updates of
safe time, so these RO transactions will be failed by timeout or even earlier
(see [Transaction
Protocol](https://cwiki.apache.org/confluence/display/IGNITE/IEP-91%3A+Transaction+protocol)).
So, it is not an issue for correctness.
+
+#### 3.1 Notification about the new leader and rebalance events
+Current rebalance algorithm based on the metastore invokes and local rebalance
listeners.
+
+But for the new one we have an idea, which doesn't need the metastore at all:
+- On rebalanceDone/rebalanceError/leaderElected events the local event
listener send a message to PrimaryReplica with the description of event
+- If PrimaryReplica is not available - we should retry send, until leader
didn't find himself outdated (in this case, new leader will send leaderElected
event to PrimaryReplica and receives the rebalance request again.
+
+### 4. Stop the redundant replicas and update replicas clients
+Here we need to:
+- Stop the redundant replicas, which is not in the current stable assignments
+ - We can accidentally stop PrimaryReplica, so we need to use the algorithm
of a graceful PrimaryReplica transfer, if needed.
+- Update the replication protocol clients (RaftGroupService, for example) on
each Replica.
+
+### Failover logic
+The main idea of failover process: every rebalance request
PlacementDriver->PrimaryReplica or PrimaryReplica->ReplicationGroup must be
idempotent. So, redundant request in the worst case should be just answered by
positive answer (just like rebalance is already done).
+
+After that we can prepare the following logic:
+- On every new PD leader elected - it must check the direct value (not the
locally cached one) of `zoneId.assignment.pending` keys and send
RebalanceRequest to needed PrimaryReplicas and then listen updates from the
last revision.
+- On every PrimaryReplica reelection by PD it must send the new
RebalanceRequest to PrimaryReplica, if pending key is not empty.
+- On every leader reelection (for the leader oriented protocols) inside the
replication group - leader send leaderElected event to PrimaryReplica, which
force PrimaryReplica to send RebalanceRequest to the replication group leader
again.
+
+
diff --git a/modules/distribution-zones/tech-notes/src/flow.puml
b/modules/distribution-zones/tech-notes/src/flow.puml
new file mode 100644
index 0000000000..6d85c22d06
--- /dev/null
+++ b/modules/distribution-zones/tech-notes/src/flow.puml
@@ -0,0 +1,29 @@
+@startuml flow
+title General Rebalance Flow (diagram 1)
+
+skinparam maxMessageSize 400
+skinparam defaultFontSize 12
+
+User -> DistributionConfiguration : Change a number of replicas
+
+DistributionConfiguration --> DistributionZoneManager : Receives an update
through the distribution configuration listener.
+
+DistributionZoneManager -> Metastore : Calculate new assignments based on the
current datanodes and put the result of calculations to
**zoneId.assignments.pending**/**planned** key [see 1]
+note left
+this put must be covered by the similar logic
+which we have in the current rebalance
+to prevent the metastore call from multiple nodes
+due to the fact, because each node has DistributionZoneManager
+and listens configuration updates
+end note
+
+Metastore --> DistributionZoneManager : Receives an update of
**zoneId.assignments.pending** key and starts replica server if needed
+Metastore --> PlacementDriver : Receives an update of
**zoneId.assignments.pending** key.
+PlacementDriver -> PartitionPrimaryReplica : Send a RebalanceRequest to
PrimaryReplica for the rebalance of its group
+PartitionPrimaryReplica -> PartitionPrimaryReplica: Await for all replicas
start [see 2]
+PartitionPrimaryReplica -> PartitionPrimaryReplica : Process replication group
update [see 3 and separate diagram 2]
+PartitionPrimaryReplica -> PlacementDriver : Notify about rebalance done.
PlacementDriver updates itself's cache for rebalanced group with the addresses
of the new members.
+PlacementDriver -> Metastore : Put the **zoneId.assignments.stable** key
+Metastore --> DistributionZoneManager : Receives the update about
**zoneId.assignments.stable** update and stop the unneeded replication group
members on the current node, if needed [see 4]
+DistributionZoneManager -> DistributionZoneManager : Check if
**zoneId.assignments.planned** key is empty and start new rebalance if not
+@enduml
diff --git a/modules/distribution-zones/tech-notes/src/primaryReplica.puml
b/modules/distribution-zones/tech-notes/src/primaryReplica.puml
new file mode 100644
index 0000000000..7a658d3a33
--- /dev/null
+++ b/modules/distribution-zones/tech-notes/src/primaryReplica.puml
@@ -0,0 +1,34 @@
+@startuml primaryReplica
+title PrimaryReplica and Replication Group communication (diagram 1)
+
+skinparam maxMessageSize 400
+skinparam defaultFontSize 12
+
+participant PlacementDriver
+participant PrimaryReplica
+
+participant Replica1 [
+Replica1
+Leader for term 1
+]
+
+participant Replica2 [
+Replica2
+Leader for term 2
+]
+
+PrimaryReplica -> Replica1 : Send a changePeersAsync request (node is the
leader at the moment)
+Replica1 -> Replica1 : Leader was stepped down and new leader election will
start.
+Replica2 -> Replica2 : Current node elected as a leader.
+Replica2 -> PrimaryReplica : Send a message about the new leader elected. [see
3.1 details]
+PrimaryReplica -> PrimaryReplica : Check local state for ongoing rebalance for
the replica group.
+note left
+we can use local state here,
+because if the PrimaryReplica fails,
+PlacementDriver will choose another one
+and start rebalance again by himself
+end note
+PrimaryReplica -> Replica2 : Send changePeers to the new leader again
+Replica2 -> PrimaryReplica : Rebalance done message
+PrimaryReplica -> PlacementDriver : Rebalance done message. PD than do other
operations from the general rebalance diagram.
+@enduml