This is an automated email from the ASF dual-hosted git repository.
sk0x50 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/ignite-3.git
The following commit(s) were added to refs/heads/main by this push:
new c276a334c4 IGNITE-17056 Added design documents for rebalance
cancellation. Fixes #1676
c276a334c4 is described below
commit c276a334c4c3742520494bccdc957f4530c8ed7a
Author: Kirill Gusakov <[email protected]>
AuthorDate: Wed Feb 22 15:48:41 2023 +0200
IGNITE-17056 Added design documents for rebalance cancellation. Fixes #1676
Signed-off-by: Slava Koptilin <[email protected]>
---
.../tech-notes/images/cancelRebalance.svg | 1 +
.../distribution-zones/tech-notes/images/flow.svg | 2 +-
modules/distribution-zones/tech-notes/rebalance.md | 56 +++++++++++++++++++---
.../tech-notes/src/cancelRebalance.puml | 18 +++++++
.../distribution-zones/tech-notes/src/flow.puml | 5 +-
5 files changed, 72 insertions(+), 10 deletions(-)
diff --git a/modules/distribution-zones/tech-notes/images/cancelRebalance.svg
b/modules/distribution-zones/tech-notes/images/cancelRebalance.svg
new file mode 100644
index 0000000000..8388de8e45
--- /dev/null
+++ b/modules/distribution-zones/tech-notes/images/cancelRebalance.svg
@@ -0,0 +1 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?><svg
xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"
contentStyleType="text/css" height="620px" preserveAspectRatio="none"
style="width:1804px;height:620px;background:#FFFFFF;" version="1.1" viewBox="0
0 1804 620" width="1804px" zoomAndPan="magnify"><defs/><g><line
style="stroke:#181818;stroke-width:0.5;stroke-dasharray:5.0,5.0;" x1="67"
x2="67" y1="36.4883" y2="584.6992"/><line style="stroke:#181818;stro [...]
\ No newline at end of file
diff --git a/modules/distribution-zones/tech-notes/images/flow.svg
b/modules/distribution-zones/tech-notes/images/flow.svg
index 06bb3d5a18..d846da8bed 100644
--- a/modules/distribution-zones/tech-notes/images/flow.svg
+++ b/modules/distribution-zones/tech-notes/images/flow.svg
@@ -1 +1 @@
-<?xml version="1.0" encoding="UTF-8" standalone="no"?><svg
xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"
contentStyleType="text/css" height="648px" preserveAspectRatio="none"
style="width:2216px;height:648px;background:#FFFFFF;" version="1.1" viewBox="0
0 2216 648" width="2216px" zoomAndPan="magnify"><defs/><g><text fill="#000000"
font-family="sans-serif" font-size="14" font-weight="bold"
lengthAdjust="spacing" textLength="258" x="978.5" y="28.5352">Genera [...]
\ No newline at end of file
+<?xml version="1.0" encoding="UTF-8" standalone="no"?><svg
xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"
contentStyleType="text/css" height="607px" preserveAspectRatio="none"
style="width:2270px;height:607px;background:#FFFFFF;" version="1.1" viewBox="0
0 2270 607" width="2270px" zoomAndPan="magnify"><defs/><g><text fill="#000000"
font-family="sans-serif" font-size="14" font-weight="bold"
lengthAdjust="spacing" textLength="258" x="1005.5" y="28.5352">Gener [...]
\ No newline at end of file
diff --git a/modules/distribution-zones/tech-notes/rebalance.md
b/modules/distribution-zones/tech-notes/rebalance.md
index 3f33f8869f..f55cdf0b98 100644
--- a/modules/distribution-zones/tech-notes/rebalance.md
+++ b/modules/distribution-zones/tech-notes/rebalance.md
@@ -55,18 +55,62 @@ But for the new one we have an idea, which doesn't need the
metastore at all:
- On rebalanceDone/rebalanceError/leaderElected events the local event
listener send a message to PrimaryReplica with the description of event
- If PrimaryReplica is not available - we should retry send, until leader
didn't find himself outdated (in this case, new leader will send leaderElected
event to PrimaryReplica and receives the rebalance request again.
-### 4. Stop the redundant replicas and update replicas clients
+### 4. Update the rebalance state after successful rebalance
+Within the single atomic metastore invoke we must update the keys according to
the following pseudo-code:
+```
+ metastoreInvoke: \\ atomic
+ zoneId.assignment.stable = newPeers
+ remove(zoneId.assignment.cancel)
+ if empty(zoneId.assignment.planned):
+ zoneId.assignment.pending = empty
+ else:
+ zoneId.assignment.pending = zoneId.assignment.planned
+ remove(zoneId.assignment.planned)
+```
+About the `*.cancel` key you can read
[below](#cancel-an-ongoing-rebalance-process-if-needed)
+
+### 5. Stop the redundant replicas and update replicas clients
Here we need to:
- Stop the redundant replicas, which is not in the current stable assignments
- We can accidentally stop PrimaryReplica, so we need to use the algorithm
of a graceful PrimaryReplica transfer, if needed.
- Update the replication protocol clients (RaftGroupService, for example) on
each Replica.
-### Failover logic
-The main idea of failover process: every rebalance request
PlacementDriver->PrimaryReplica or PrimaryReplica->ReplicationGroup must be
idempotent. So, redundant request in the worst case should be just answered by
positive answer (just like rebalance is already done).
+## Failover logic
+The main idea of failover process: every rebalance request and cancel
rebalance request PlacementDriver->PrimaryReplica or
PrimaryReplica->ReplicationGroup must be idempotent. So, redundant request in
the worst case should be just answered by positive answer (just like rebalance
is already done).
After that we can prepare the following logic:
-- On every new PD leader elected - it must check the direct value (not the
locally cached one) of `zoneId.assignment.pending` keys and send
RebalanceRequest to needed PrimaryReplicas and then listen updates from the
last revision.
-- On every PrimaryReplica reelection by PD it must send the new
RebalanceRequest to PrimaryReplica, if pending key is not empty.
-- On every leader reelection (for the leader oriented protocols) inside the
replication group - leader send leaderElected event to PrimaryReplica, which
force PrimaryReplica to send RebalanceRequest to the replication group leader
again.
+- On every new PD leader elected - it must check the direct value (not the
locally cached one) of `zoneId.assignment.pending`/`zondeId.assignment.cancel`
(the last one always wins, if exists) keys and send
`RebalanceRequest`/`CancelRebalanceRequest` to needed PrimaryReplicas and then
listen updates from the last revision of this key.
+- On every PrimaryReplica reelection by PD it must send the new
`RebalanceRequest`/`CancelRebalanceRequest` to PrimaryReplica, if
pending/cancel (cancel always wins, if filled) key is not empty.
+- On every leader reelection (for the leader oriented protocols) inside the
replication group - leader sends leaderElected event to PrimaryReplica, which
forces PrimaryReplica to send RebalanceRequest/CancelRebalanceRequest to the
replication group leader again.
+
+More over:
+- `RebalanceRequest`/`CancelRebalanceRequest` must include the revision of
its' trigger.
+- PrimaryReplica must persist the last seen revision locally.
+- When new PrimaryReplica elected, PlacementDriver must initialize the last
seen revision of PrimaryReplica to the `currentRevision-1`. So, after that
PlacementDriver must send the *Request with current actual revision.
+- PrimaryReplica must skip any requests, if their revision < lastSeenRevision.
+These actions protect PrimaryReplica from processing the messages from
inactive PlacementDriver.
+## Cancel an ongoing rebalance process if needed
+Sometimes we must cancel the ongoing rebalance:
+- We can receive an unrecoverable error from replication group during the
current rebalance
+- We can decide to cancel it manually
+
+
+
+### 1. Put rebalance intent to *.cancel key
+For the purpose of persisting for cancel intent, we must save the
(oldTopology, newTopology) pair of peers lists to `zoneId.assignment.cancel`
key.
+Also, every invoke with update of `*.cancel` key must be enriched by revision
of the pending key, which must be cancelled:
+```
+ if(zoneId.assignment.pending.revision == inputRevision):
+ zoneId.assignment.cancel = cancelValue
+ return true
+ else:
+ return false
+```
+It's needed to prevent the race, between the rebalance done and cancel
persisting, otherwise we can try to cancel the wrong rebalance process.
+### 2. PrimaryReplica->ReplicationGroup cancel protocol
+When PrimaryReplica send `CancelRebalanceRequest(oldTopology, newTopology)` to
the ReplicationGroup following cases are possible:
+- Replication group has ongoing rebalance oldTopology->newTopology. So, it
must be cancelled and cleanup for the configuration state of replication group
to oldTopology must be executed.
+- Replication group has no ongoing rebalance and currentTopology==oldTopology.
So, nothing to cancel, return success response.
+- Replication group has no ongoing rebalance and currentTopology==newTopology.
So, cancel request can't be executed, return the response about it. Result
recipient of this response (placement driver) must log this fact and do the
same routine for usual rebalanceDone.
diff --git a/modules/distribution-zones/tech-notes/src/cancelRebalance.puml
b/modules/distribution-zones/tech-notes/src/cancelRebalance.puml
new file mode 100644
index 0000000000..ace3e72a5c
--- /dev/null
+++ b/modules/distribution-zones/tech-notes/src/cancelRebalance.puml
@@ -0,0 +1,18 @@
+@startuml
+skinparam maxMessageSize 400
+
+PlacementDriver -> PrimaryReplica: Send RebalanceRequest
+PrimaryReplica -> ReplicationGroup : Start rebalance process
+ReplicationGroup -> PrimaryReplica : Send rebalanceError message
+PrimaryReplica -> PlacementDriver : Send rebalanceError message
+PlacementDriver -> PlacementDriver: Decides to give up with this rebalance at
all
+PlacementDriver -> Metastore : Put (oldTopology, newTopology) to
**zonedId.assignment.cancel** key [see 1]
+PlacementDriver <- Metastore : Receives the notification about ***.cancel**
update
+PlacementDriver -> PrimaryReplica : Send CancelRebalanceRequest(oldTopology,
newTopology)
+PrimaryReplica -> ReplicationGroup : Send CancelRebalanceRequest(oldTopology,
newTopology) [see 2]
+ReplicationGroup -> ReplicationGroup : Process cancellation request and invoke
onCancelDone listener
+ReplicationGroup -> PrimaryReplica : Send CancelRebalanceDone message
+PrimaryReplica -> PlacementDriver : Send CancelRebalanceDone message
+PlacementDriver -> Metastore : Invoke general rebalance mestastore
multi-invoke to update ***.planned/*.pending/*.cancel** and set current stable
topology to the same **zonedId.assignment.stable** (needs to force the
metastore update)
+DistributionZoneManager <- Metastore : Receives **zonedId.assignment.stable**
update, stop unneeded replicas, prepare new rebalance if needed
+@enduml
diff --git a/modules/distribution-zones/tech-notes/src/flow.puml
b/modules/distribution-zones/tech-notes/src/flow.puml
index 6d85c22d06..64831d6d38 100644
--- a/modules/distribution-zones/tech-notes/src/flow.puml
+++ b/modules/distribution-zones/tech-notes/src/flow.puml
@@ -23,7 +23,6 @@ PlacementDriver -> PartitionPrimaryReplica : Send a
RebalanceRequest to PrimaryR
PartitionPrimaryReplica -> PartitionPrimaryReplica: Await for all replicas
start [see 2]
PartitionPrimaryReplica -> PartitionPrimaryReplica : Process replication group
update [see 3 and separate diagram 2]
PartitionPrimaryReplica -> PlacementDriver : Notify about rebalance done.
PlacementDriver updates itself's cache for rebalanced group with the addresses
of the new members.
-PlacementDriver -> Metastore : Put the **zoneId.assignments.stable** key
-Metastore --> DistributionZoneManager : Receives the update about
**zoneId.assignments.stable** update and stop the unneeded replication group
members on the current node, if needed [see 4]
-DistributionZoneManager -> DistributionZoneManager : Check if
**zoneId.assignments.planned** key is empty and start new rebalance if not
+PlacementDriver -> Metastore : Update the
zoneId.assignments.stable/planned/pending/cancel key [see 4]
+Metastore --> DistributionZoneManager : Receives the update about
**zoneId.assignments.stable** update and stop the unneeded replication group
members on the current node, if needed [see 5]
@enduml