kgusakov commented on code in PR #894:
URL: https://github.com/apache/ignite-3/pull/894#discussion_r904956328
##########
modules/table/tech-notes/rebalance.md:
##########
@@ -1,92 +1,80 @@
# How to read this doc
Every algorithm phase has the following main sections:
-- Trigger - how current phase will be invoked
-- Steps/Pseudocode - the main logical steps of the current phase
-- Result (optional, if pseudocode provided) - events and system state changes,
which this phase produces
+- Trigger – how current phase will be invoked
+- Steps/Pseudocode – the main logical steps of the current phase
+- Result (optional, if pseudocode provided) – events and system state changes,
which this phase produces
# Rebalance algorithm
## Short algorithm description
- Operations, which can trigger rebalance occurred:
- Write new baseline to metastore (effectively from 1 node in cluster)
-
- OR
-
Write new replicas configuration number to table config (effectively from
1 node)
OR
Write new partitions configuration number to table config (effectively
from 1 node)
- Write new assignments' intention to metastore (effectively from 1 node in
cluster)
-- Start new raft nodes. Initiate/update change peer request to raft group
(effectively from 1 node per partition)
+- Start new raft nodes. Initiate/update asynchronous change peer request to
raft group (effectively from 1 node per partition)
- Stop all redundant nodes. Change stable partition assignment to the new one
and finish rebalance process.
## New metastore keys
For further steps, we should introduce some new metastore keys:
-- `partition.assignments.stable` - the list of peers, which process operations
for partition at the current moment.
+- `partition.assignments.stable` - the list of peers, which process operations
for a partition at the current moment.
- `partition.assignments.pending` - the list of peers, where current rebalance
move the partition.
- `partition.assignments.planned` - the list of peers, which will be used for
new rebalance, when current will be finished.
Also, we will need the utility key:
-- `partition.assignments.change.trigger.revision` - the key, needed for
processing the event about assignments' update trigger only once.
+- `partition.change.trigger.revision` - the key, needed for processing the
event about assignments' update trigger only once.
## Operations, which can trigger rebalance
Three types of events can trigger the rebalance:
-- Change of baseline metastore key (1 for all tables for now, but maybe it
should be separate per table in future)
- Configuration change through
`org.apache.ignite.configuration.schemas.table.TableChange.changeReplicas`
produce metastore update event
- Configuration change through
`org.apache.ignite.configuration.schemas.table.TableChange.changePartitions`
produce metastore update event (IMPORTANT: this type of trigger has additional
difficulties because of cross raft group data migration and it is out of scope
of this document)
-**Result**: So, one of three metastore keys' changes will trigger rebalance:
+**Result**: So, one of two metastore keys' changes will trigger rebalance:
```
-<global>.baseline
<tableScope>.replicas
<tableScope>.partitions // out of scope
```
## Write new pending assignments (1)
**Trigger**:
-- Metastore event about change in `<global>.baseline`
-- Metastore event about changes in `<tableScope>.replicas`
+- Metastore event about changes in `<tableScope>.replicas` (See
`org.apache.ignite.internal.table.distributed.TableManager.onUpdateReplicas`)
**Pseudocode**:
-```
-onBaselineEvent:
- for table in tableCfg.tables():
- for partition in table.partitions:
- <inline metastoreInvoke>
-
+```
onReplicaNumberChange:
with table as event.table:
for partitoin in table.partitions:
<inline metastoreInvoke>
metastoreInvoke: // atomic metastore call through multi-invoke api
- if empty(partition.assignments.change.trigger.revision) ||
partition.assignments.change.trigger.revision < event.revision:
+ if empty(partition.change.trigger.revision) ||
partition.change.trigger.revision < event.revision:
if empty(partition.assignments.pending) &&
partition.assignments.stable != calcPartAssighments():
partition.assignments.pending = calcPartAssignments()
- partition.assignments.change.trigger.revision = event.revision
+ partition.change.trigger.revision = event.revision
else:
if partition.assignments.pending != calcPartAssignments
partition.assignments.planned = calcPartAssignments()
- partition.assignments.change.trigger.revision = event.revision
+ partition.change.trigger.revision = event.revision
else
remove(partition.assignments.planned)
else:
skip
```
## Start new raft nodes and initiate change peers (2)
-**Trigger**: Metastore event about new `partition.assignments.pending` received
+**Trigger**: Metastore event about new `partition.assignments.pending`
received (See corresponding listener for pending key in
`org.apache.ignite.internal.table.distributed.TableManager.registerRebalanceListeners`)
**Steps**:
- Start all new needed nodes `partition.assignments.pending /
partition.assignments.stable`
-- After successful starts - check if current node is the leader of raft group
(leader response must be updated by current term) and `changePeers(leaderTerm,
peers)`. `changePeers` from old terms must be skipped.
+- After successful starts - check if current node is the leader of raft group
(leader response must be updated by current term) and run
`RaftGroupService#changePeersAsync(leaderTerm, peers)`.
`RaftGroupService#changePeersAsync` from old terms must be skipped.
**Result**:
- New needed raft nodes started
- Change peers state initiated for every raft group
-## When changePeers done inside the raft group - stop all redundant nodes
-**Trigger**: When leader applied new Configuration with list of resulting
peers `<applied peer>`, it calls `onChangePeersCommitted(<applied peers>)`
+## When RaftGroupService#changePeersAsync done inside the raft group - update
stable key and stop all redundant nodes
Review Comment:
Could you add some words about assignments' configuration update and further
updates of raft clients for table?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]