[GitHub] [ignite-3] sanpwc commented on a diff in pull request #1446: IGNITE-18290 Initial redesign notes

GitBox Wed, 14 Dec 2022 08:36:58 -0800


sanpwc commented on code in PR #1446:
URL: https://github.com/apache/ignite-3/pull/1446#discussion_r1048637665



##########
modules/table/tech-notes/rebalance_redesign.md:
##########
@@ -0,0 +1,87 @@
+# Introduction
+TBD
+
+# Rebalance triggers
+- Table configuration changes: number of partitions, number of replicas
+- Logical topology changes: nodes appearing/dissappearing
+- Distribution zone changes

Review Comment:
   It's a bit confusing, because some of aforementioned items are included into 
"distribution zone changes".
   I'd rephrase it a bit in a similar manner:
   # Rebalance triggers
   - Table configuration changes: number of replicas.
   - Data nodes changes triggered by logical topology events and distribution 
zone filter updates.



##########
modules/table/tech-notes/rebalance_redesign.md:
##########
@@ -0,0 +1,87 @@
+# Introduction
+TBD
+
+# Rebalance triggers
+- Table configuration changes: number of partitions, number of replicas
+- Logical topology changes: nodes appearing/dissappearing
+- Distribution zone changes
+- In future: uneven datanode load distribution (cpu, io, memory)
+
+Different rebalance causes can have concurrent requests for rebalance and we 
must have the way linearize them and prepare optimal plan for received triggers.
+
+Before diving into the abyss, let's review simple example to understand the 
main issues of a rebalance scope.
+
+## Example of triggers' races
+
+Let's take:
+- Topology with 3 nodes A, B, C, D
+- Table t1 with 1 partition p1 on (A, B) with replica factor 2
+
+And imagine:
+- t1 configuration updated with  replica factor 3 and rebalance for p1 started 
(A, B) -> (A, B, C)
+- While rebalance is in progress, node C leave the logical topology, so, 
rebalance failed and retried again and again with the old task (A, B) -> (A, B, 
C), but in general new available configuration should be (A, B) -> (A, B, D). 
So, which component should be aware about this type of failure and can produce 
the new rebalance task according to new knowledges? In general this component 
can cancel all outdated rebalance tasks, which consists of leaved node. And as 
quicker it will do it, the more chances to stabilize the cluster before the new 
node failures, for example.
+
+It looks like we need to have a single component, which prepare rebalance 
schedules and updates it, when needed. This coordinator must have enough power 
and vision to calculate the optimal cross-partions and even cross-table plans.
+
+To be honest, it can be not the one component, but the group of distributed 
algrorithms around general distributed storage (like metastorage). But this 
approach will have exponent complexity with more and more triggers for 
rebalance. Current rebalance design shows us, that even for: 1 simple trigger 
(replication number changes) + cancellation + in-memory cluster needs; it is 
not easy to prepare the simple correct mechanism.
+
+# Rebalance component
+According to the previous chapter we need a single (but distiributed) 
component (let's name it RebalanceCoordinator) to coordinate the optimal 
schedule for all received rebalance triggers at the moment.
+
+In general, different approaches available:
+- Create the new one raft group and use it's leader as a RebalanceCoordinator
+- Use one of the current groups: metastorage group, CMG group
+
+**Subtotals: Rebalance coordinator must be the one per cluster, but 
distributed across some nodes component.**
+
+## Choose the right one
+- A separate RAFT group is an option, but it looks like it will bring 
additional cluster configuration complexity: Which nodes should be in this 
topology? How will users configure it and how to explain this?
+- From the already exsisence groups, the CMG group looks suitable for this 
case both by name semantic and by the current functions:
+    - It has the separate state machine, which we can use for storing the 
rebalance schedule
+    - It already has requirements for proper user configuration
+    - We can use it even for rebalance metastorage nodes
+
+**Subtotals: RebalanceCoordinator can be a part of the current CMG leader.**
+
+## Describe a data's flow between RebalanceCoordinator and partition groups
+At first, we need to create as simple as possible data's flow between the 
RebalanceCoordinator and partition groups to handle the issues, which connected 
with:
+- Partition's rebalance's fails
+- Drop the rebalance process after leader re-election inside partition group 
(if rebalance process was in STAGE_CATCHING_UP)
+
+The main idea of this flow - new rebalance tasks goes only from 
RebalanceCoordinator to partition group. At the same time, partition groups 
report about failes/completes and current progress (optional) to 
RebalanceCoordiator.
+
+In the pseudo event-based oop-like language:
+```
+RebalanceCoordinator:
+    onNewTrigger:
+        calculateNewPlan()
+        pushPartitionPlans()
+
+    onNewParitionLeaderElectedReceived:
+        pushCurrentPartitionTask()
+
+    onErrorReceived:

Review Comment:
   What about cmg leader unavailability during onError and similar events? 



##########
modules/table/tech-notes/rebalance_redesign.md:
##########
@@ -0,0 +1,87 @@
+# Introduction
+TBD
+
+# Rebalance triggers
+- Table configuration changes: number of partitions, number of replicas
+- Logical topology changes: nodes appearing/dissappearing
+- Distribution zone changes
+- In future: uneven datanode load distribution (cpu, io, memory)
+
+Different rebalance causes can have concurrent requests for rebalance and we 
must have the way linearize them and prepare optimal plan for received triggers.
+
+Before diving into the abyss, let's review simple example to understand the 
main issues of a rebalance scope.
+
+## Example of triggers' races
+
+Let's take:
+- Topology with 3 nodes A, B, C, D
+- Table t1 with 1 partition p1 on (A, B) with replica factor 2
+
+And imagine:
+- t1 configuration updated with  replica factor 3 and rebalance for p1 started 
(A, B) -> (A, B, C)
+- While rebalance is in progress, node C leave the logical topology, so, 
rebalance failed and retried again and again with the old task (A, B) -> (A, B, 
C), but in general new available configuration should be (A, B) -> (A, B, D). 
So, which component should be aware about this type of failure and can produce 
the new rebalance task according to new knowledges? In general this component 
can cancel all outdated rebalance tasks, which consists of leaved node. And as 
quicker it will do it, the more chances to stabilize the cluster before the new 
node failures, for example.
+
+It looks like we need to have a single component, which prepare rebalance 
schedules and updates it, when needed. This coordinator must have enough power 
and vision to calculate the optimal cross-partions and even cross-table plans.
+
+To be honest, it can be not the one component, but the group of distributed 
algrorithms around general distributed storage (like metastorage). But this 
approach will have exponent complexity with more and more triggers for 
rebalance. Current rebalance design shows us, that even for: 1 simple trigger 
(replication number changes) + cancellation + in-memory cluster needs; it is 
not easy to prepare the simple correct mechanism.
+
+# Rebalance component
+According to the previous chapter we need a single (but distiributed) 
component (let's name it RebalanceCoordinator) to coordinate the optimal 
schedule for all received rebalance triggers at the moment.
+
+In general, different approaches available:
+- Create the new one raft group and use it's leader as a RebalanceCoordinator
+- Use one of the current groups: metastorage group, CMG group
+
+**Subtotals: Rebalance coordinator must be the one per cluster, but 
distributed across some nodes component.**
+
+## Choose the right one
+- A separate RAFT group is an option, but it looks like it will bring 
additional cluster configuration complexity: Which nodes should be in this 
topology? How will users configure it and how to explain this?
+- From the already exsisence groups, the CMG group looks suitable for this 
case both by name semantic and by the current functions:
+    - It has the separate state machine, which we can use for storing the 
rebalance schedule
+    - It already has requirements for proper user configuration
+    - We can use it even for rebalance metastorage nodes
+
+**Subtotals: RebalanceCoordinator can be a part of the current CMG leader.**
+
+## Describe a data's flow between RebalanceCoordinator and partition groups
+At first, we need to create as simple as possible data's flow between the 
RebalanceCoordinator and partition groups to handle the issues, which connected 
with:
+- Partition's rebalance's fails
+- Drop the rebalance process after leader re-election inside partition group 
(if rebalance process was in STAGE_CATCHING_UP)
+
+The main idea of this flow - new rebalance tasks goes only from 
RebalanceCoordinator to partition group. At the same time, partition groups 
report about failes/completes and current progress (optional) to 
RebalanceCoordiator.
+
+In the pseudo event-based oop-like language:
+```
+RebalanceCoordinator:
+    onNewTrigger:
+        calculateNewPlan()
+        pushPartitionPlans()

Review Comment:
   What do you mean by pushing partition plans?



##########
modules/table/tech-notes/rebalance_redesign.md:
##########
@@ -0,0 +1,87 @@
+# Introduction
+TBD
+
+# Rebalance triggers
+- Table configuration changes: number of partitions, number of replicas

Review Comment:
   Well, I don't think that we will support partition number changes because it 
required split and merge logic.



##########
modules/table/tech-notes/rebalance_redesign.md:
##########
@@ -0,0 +1,87 @@
+# Introduction
+TBD
+
+# Rebalance triggers
+- Table configuration changes: number of partitions, number of replicas
+- Logical topology changes: nodes appearing/dissappearing
+- Distribution zone changes
+- In future: uneven datanode load distribution (cpu, io, memory)
+
+Different rebalance causes can have concurrent requests for rebalance and we 
must have the way linearize them and prepare optimal plan for received triggers.
+
+Before diving into the abyss, let's review simple example to understand the 
main issues of a rebalance scope.
+
+## Example of triggers' races
+
+Let's take:
+- Topology with 3 nodes A, B, C, D
+- Table t1 with 1 partition p1 on (A, B) with replica factor 2
+
+And imagine:
+- t1 configuration updated with  replica factor 3 and rebalance for p1 started 
(A, B) -> (A, B, C)
+- While rebalance is in progress, node C leave the logical topology, so, 
rebalance failed and retried again and again with the old task (A, B) -> (A, B, 
C), but in general new available configuration should be (A, B) -> (A, B, D). 
So, which component should be aware about this type of failure and can produce 
the new rebalance task according to new knowledges? In general this component 
can cancel all outdated rebalance tasks, which consists of leaved node. And as 
quicker it will do it, the more chances to stabilize the cluster before the new 
node failures, for example.
+
+It looks like we need to have a single component, which prepare rebalance 
schedules and updates it, when needed. This coordinator must have enough power 
and vision to calculate the optimal cross-partions and even cross-table plans.
+
+To be honest, it can be not the one component, but the group of distributed 
algrorithms around general distributed storage (like metastorage). But this 
approach will have exponent complexity with more and more triggers for 
rebalance. Current rebalance design shows us, that even for: 1 simple trigger 
(replication number changes) + cancellation + in-memory cluster needs; it is 
not easy to prepare the simple correct mechanism.
+
+# Rebalance component
+According to the previous chapter we need a single (but distiributed) 
component (let's name it RebalanceCoordinator) to coordinate the optimal 
schedule for all received rebalance triggers at the moment.
+
+In general, different approaches available:
+- Create the new one raft group and use it's leader as a RebalanceCoordinator
+- Use one of the current groups: metastorage group, CMG group
+
+**Subtotals: Rebalance coordinator must be the one per cluster, but 
distributed across some nodes component.**
+
+## Choose the right one
+- A separate RAFT group is an option, but it looks like it will bring 
additional cluster configuration complexity: Which nodes should be in this 
topology? How will users configure it and how to explain this?
+- From the already exsisence groups, the CMG group looks suitable for this 
case both by name semantic and by the current functions:
+    - It has the separate state machine, which we can use for storing the 
rebalance schedule
+    - It already has requirements for proper user configuration
+    - We can use it even for rebalance metastorage nodes
+
+**Subtotals: RebalanceCoordinator can be a part of the current CMG leader.**

Review Comment:
   How CMG group will see distribution zone events build on top of the meta 
storage?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [ignite-3] sanpwc commented on a diff in pull request #1446: IGNITE-18290 Initial redesign notes

Reply via email to