This is an automated email from the ASF dual-hosted git repository.

jiajunwang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/helix.wiki.git


The following commit(s) were added to refs/heads/master by this push:
     new 0195fcd  Created Weight-Aware Globally-Even Distribute Rebalancer 
Design Proposal (markdown)
0195fcd is described below

commit 0195fcdfd326dca72c589e41562260a0a153b3e5
Author: jiajunwang <[email protected]>
AuthorDate: Thu Jul 11 12:23:29 2019 -0700

    Created Weight-Aware Globally-Even Distribute Rebalancer Design Proposal 
(markdown)
---
 ...y-Even-Distribute-Rebalancer-Design-Proposal.md | 215 +++++++++++++++++++++
 1 file changed, 215 insertions(+)

diff --git 
a/Weight-Aware-Globally-Even-Distribute-Rebalancer-Design-Proposal.md 
b/Weight-Aware-Globally-Even-Distribute-Rebalancer-Design-Proposal.md
new file mode 100644
index 0000000..27869c8
--- /dev/null
+++ b/Weight-Aware-Globally-Even-Distribute-Rebalancer-Design-Proposal.md
@@ -0,0 +1,215 @@
+# Introduction
+We propose a new weight-aware globally-even distribute Rebalancer to better 
meet the application's requirement. Compared with the existing CRUSH-based 
Full-Auto rebalancers, the WAGED rebalancer will have the following major 
improvements.
+
+* Partition weight-aware
+
+The WAGED rebalancer calculates partition assignment according to the various 
resource usage of partitions.
+Helix users can configure multi-dimensional weights of the partition's 
resource usage. In addition, they will set up the corresponding instance 
capacities.
+
+* Globally-even distribution
+
+The WAGED rebalancer optimizes the partition assignment in terms of total 
partition weights across all the instances. This means all the partitions 
(replications) in different Helix resources are rebalanced together. While 
these partitions share the same physical resource, the rebalancer should assign 
partitions based on the global resource usage view.
+
+* The trade-off between partition movements and even distribution
+
+The WAGED rebalancer minimizes unnecessary partition movements. In addition, 
it allows the Helix users to specify a preference between the evenness and 
partition movement. In short, the rebalancer can support strict even 
distribution but will cause more partition movements. Or it can reduce the 
partition movements but the distribution becomes less even. In the production 
environment, a stateless system would prefer evenness since partition movement 
cost is very little. On the other hand, [...]
+Moreover, the new rebalancer will support cluster reconfiguration gracefully. 
It minimizes the effort required to do the migration. The WAGED rebalancer 
remembers the rebalance state and adjusts the assignment incrementally.
+
+* Flexibility
+
+The WAGED rebalancer is extendable. It can be configured or extended to fit 
different rebalance requirements. Interfaces of the WAGED rebalancer will be 
more generic. The developers can even extend the rebalance algorithm without 
modifying the rebalancer's main logic.
+
+# Proposed Design
+
+## Objectives
+
+### Partition Weight-Aware Globally-Even Distribute
+
+As the name suggested, "Weight-Aware" and "Globally-Even" are the most 
important goals for the new rebalancer.
+Firstly, partition weight-awareness enables the rebalancer to understand the 
real system workload. The partition weight should be customizable and 
multi-dimensional. Helix users need to specify per-partition weight based on 
the expected system resource usage. Such as CPU, memory, storage usage. In 
addition, correspondingly to the weights, the users are required to specify the 
instance capacity. As an example, we will see the weight/capacity 
configurations as shown below.
+
+> Capacity / weight example
+> Instance Capacity:
+>   - CPU: 1000
+>   - MEMORY: 5000
+>   - STORAGE: 500
+>
+> DB Partition Weight:
+>   - CPU: 50
+>   - MEMORY: 100
+>   - STORAGE: 10
+
+As for globally-even distribution, we expect all the instances in the cluster 
having the uniform workload distribution. Actually, global evenness is 
increasingly important when the rebalancer calculates based on 
multi-dimensional usage. For example, if two Helix resources have very 
different usage requirement, the rebalancer can co-locate their partitions so 
the overall system utilization will be improved.
+
+### Avoid Extra Partition Movements
+
+We want a sticky and consistent partition assignment. To Helix, this means 
minimal management cost. However, strictly no movement is not optimal as well. 
First of all, the partitions on the deactivated instances have to be moved out. 
Secondly, the rebalancer might need to move some partitions for even 
distribution.
+
+To be more accurate, the goal is to minimize the EXTRA partition movement. By 
"extra", we mean the partition movements that are between two unchanged and 
load-balanced instances. In theory, these movements are not necessary.
+
+Note the difference between the extra movement and the optional movement. For 
example, once a resource is removed, should the rebalancer trigger partition 
movements to fill the instances that become idle? These partition movements are 
optional but might benefit the even distribution of workload. The WAGED 
rebalancer will avoid extra partition movement. But it will allow the optional 
movement if evenness is required.
+
+### Backward Compatibility
+
+It is very important that the WAGED rebalancer keeps the same assumptions and 
capabilities as the existing rebalancers. For example, rack-aware, delayed 
rebalancing, partition movement throttling, etc. If any of these features are 
missing, the application might need to change their logic for the migrating.
+
+Moreover, besides the rebalance logic, the WAGED rebalancer should calculate 
the result as fast as the existing ones. If an emergency happens, the 
rebalancer needs to calculate the new mapping within 100ms or less. Otherwise, 
there is not enough time to proceed with the Best Possible assignment and send 
state transition messages. Considering that the WAGED rebalancer needs to 
process more information, our goal is not to shorten the calculation time but 
keep it the same as it is now.
+
+Last but not least, the WAGED rebalancer should be easy to use and fully 
integrate with the Helix Full-Auto rebalance mode. There should be minor user 
application logic change. And the additional steps that are required to 
activate the new rebalancer only happens once during the initial configuration.
+
+## Design Overview
+
+We propose to implement the WAGED rebalancer based on the constraint-based 
rebalance algorithm. The basic idea of the constraint-based algorithm is 
evaluating each possible partition allocation by a set of constraints. Then, 
the partition is assigned to the instance that has the highest evaluated score.
+One of the reasons to choose the constraint-based rebalance algorithm is that 
it calculates fast. The time complexity is O(number_constraint * 
number_total_replication * number_instance). It helps the Helix to rebalance 
fast.
+In addition, the constraint-based algorithm has a flexible and extendable 
framework. For example, to make the rebalancer aware of partition weight, we 
just need to add a new constraint about partition weight.
+
+However, only the constraint-based rebalance algorithm is not enough. Our 
current rebalancer's workflow has several limitations.
+For example, the pipeline triggers rebalance for every single resource 
repeatedly and separately. Although the input of rebalance call contains the 
complete Cluster Data Cache, the cache object does no track the ongoing 
partition assignment. So it is not feasible to do globally-even rebalance. 
Another example, the rebalance pipeline is executed on any type of cluster 
change events. Many of those events, such as the Current State change, should 
not impact the assignment. However, since th [...]
+Because of these concerns, we propose to implement a new rebalancer instead of 
developing a new Full-Auto rebalance strategy. The Helix event processing 
pipeline will also be refactored accordingly.
+
+## Quick Tutorial
+
+The minimal set of steps to activate the WAGED rebalancer.
+
+1. Configuring the capacity keys in the Cluster Config. For example, 
DISK_USAGE_GB, etc.
+1. As mentioned, Helix does not understand the meaning of these capacity keys. 
They will be used to match the instance capacity and partition weight.
+1. Configuring the instance capacity in the Instance Config. For example, 
DISK_USAGE_GB = 1000.
+1. Configuring the partition weight in the Resource Config. For example, 
DISK_USAGE_GB = 20.
+1. Modifying the Helix resource IdealStates to update with the WAGED 
rebalancer classname.
+
+For example,
+
+> {
+>   "id" : "DBName_1",
+>   "simpleFields" : {
+>     "IDEAL_STATE_MODE" : "AUTO_REBALANCE",
+>     "MIN_ACTIVE_REPLICAS" : "2",
+>     "NUM_PARTITIONS" : "1024",
+>     "REBALANCER_CLASS_NAME" : 
"org.apache.helix.controller.rebalancer.WagedRebalancer",
+>     "REBALANCE_MODE" : "FULL_AUTO",
+>     "REPLICAS" : "3",
+>     "STATE_MODEL_DEF_REF" : "MasterSlave"
+>   },
+>   "mapFields" : {},
+>   "listFields" : {}
+> }
+
+## Workflow
+To high-levelly understand how the WAGED rebalancer works, please check the 
following workflow diagram. Note that this workflow combines multiple 
components' functionality.
+
+![](https://github.com/jiajunwang/Helix-design-pics/blob/master/Core%20algorithm.png?raw=true)
+
+## Architecture
+
+![](https://github.com/jiajunwang/Helix-design-pics/blob/master/Arch.png?raw=true)
+
+### Rebalance Coordinator
+The Rebalance Coordinator controls the workflow of the rebalance process. It 
depends on multiple independent modules to calculate the Best Possible 
assignment. As shown in the architecture diagram, after the BestPossibleState 
Calculation Stage makes a rebalance call, the coordinator initializes and call 
the related components.
+
+**Rebalance Workflow**
+
+The following flowchart demonstrates how the Rebalance Coordinator conducts a 
rebalance process.
+As shown in the diagram, we have two different branches both calculating for 
one partition assignment. This special design is for achieving fast rebalance 
speed while ensuring the eventual assignment optimization.
+
+![](https://github.com/jiajunwang/Helix-design-pics/blob/master/Coordinator.png?raw=true)
+ 
+**Global Baseline Calculation**
+
+Calculating for a globally optimized partition assignment. The Global Baseline 
Calculation does not consider any temporary status, such as participants' 
offline/disabled. So it reduces the randomness of partition assignment. The 
calculation result is named as the Baseline assignment. It is used as the 
anchor of partitions allocation.
+
+The Global Baseline Calculation is only triggered when a substantial permanent 
cluster change happens. For example, cluster topology is changed, or new 
resources are added. For each calculation, the algorithm will re-assign all the 
partitions. Please note this does not imply that all the existing partitions 
will be shuffled. The Global Rebalance Calculation takes the previous Baseline 
assignment as an input. The intention is to give the algorithm a chance to 
re-allocate the partitions if [...]
+
+It is discussed that we leverage a more advanced rebalance algorithm for the 
Baseline calculation. For example, machine learning technology. In this case, 
the Baseline calculation is very possible to be delayed. Then the WAGED 
rebalancer should rely on the Partial Rebalance to calculate an intermediate 
result before the Baseline is ready. The rebalancer can always update the Best 
Possible assignment to the optimal one once the Baseline calculation is done.
+Overall, the main obstacle is the advanced rebalance algorithm itself. Once we 
have a good candidate, it would be easy to plug it into the workflow.
+
+The Baseline is not directly propagated to the final output. It is consumed by 
the Partial Rebalance as an important parameter.
+
+**Partial Rebalance**
+
+Calculating for the Best Possible assignment output based on the Baseline and 
the previous Best Possible assignment.
+
+Partial Rebalance is triggered on all the substantial cluster changes. Which 
include cluster topology change, resource config change, instance state change, 
and the Baseline assignment change. For the other trivial system changes such 
as Current State change, the Best Possible assignment should be kept the same. 
So there is no need to run the rebalance algorithm. The rebalancer directly 
returns the previously calculated result. We propose to leverage the 
constraint-based rebalance algori [...]
+
+As the name suggested, the Partial Rebalance is done with a certain rebalance 
scope. The coordinator compares the previous Best Possible assignment with the 
current cluster state so as to derive a minimal rebalance scope. In short, the 
rebalance scope only contains the following two types of partitions.
+
+The partition's current assignment becomes invalid.
+The Baseline contains some new partition assignments that do not exist in the 
current assignment.
+Note that given multiple changes can happen during the rebalance interval, the 
rebalancer will merge the corresponding rebalance scopes and finish the 
calculation in one rebalance process.
+
+**About Persisting Assignments**
+
+One concern of persisting the assignment is the additional latency and the 
extra ZK throughput. However, considering that it is only required when a new 
Best Possible assignment has been calculated, the extra cost would be minor. 
Moreover, we plan to further optimize the persisting mechanism by compression.
+
+### Cluster Change Detector
+
+The Cluster Change Detector is responsible for determining what has been 
changed. The result will be used to choose the correct rebalance approach. In 
general, the detector will keep a previous Cluster Data Cache snapshot. Then it 
compares the old snapshot with the new one for the difference. The comparison 
is done based on Zookeeper node versions as demonstrated in the following 
diagram. A content-based comparison might be necessary for some frequently 
modified ZNodes.
+
+ 
+**Input**
+
+The Cluster Data Cache. Optionally, also input the interest paths when 
initializing the detector.
+
+**Output**
+
+The ZNode paths that have been updated after the previous detect call.
+
+Note that during the very first rebalance after a Controller acquires 
leadership, the detector won't have the old snapshot. So the rebalancer will 
always trigger a globally rebalance. One potential problem is that will the 
controller leadership switch cause a large scale partition shuffling? To 
prevent this, we need the Baseline assignment and the previous Best Possible 
assignment being persisted in Zookeeper. Moreover, the constraint-based 
rebalance algorithm only moves partitions whene [...]
+
+### Cluster Data Provider
+
+The Cluster Data Provider generates a Cluster Model based on the Cluster Data 
Cache.
+The main reason we cannot use the cache directly is that we need a runtime 
object to track the pending assignment changes. Moreover, most of the 
information in the Cluster Data Cache is irrelevant to the rebalancer.
+Besides the necessary cluster status information, the Cluster Model contains 
additional transient states for optimizing the algorithm. For example, it keeps 
tracking the partitions in each fault zone. So when the rebalance algorithm 
searches for potential fault zone conflict, it does not need to iterate all the 
instances for the partition lists. In addition, even the cluster information in 
the Cluster Model might be altered according to the rebalancer's requirement. 
For instance, the del [...]
+
+**Input**
+
+Cluster Data Cache.
+
+**Output**
+
+A Cluster Model which contains the following objects.
+
+* **Assignable Node**
+
+Each active instance will be recorded as an assignable node. An assignable 
node contains information such as the instance Domain and capacity.
+
+* **Assignable Replica**
+
+Each replication of the partition is considered as an assignable object. The 
replica object contains the weight of the partition. We assume all the replicas 
in one partition have the same weight. The weight fields should fit the 
capacity fields of the Assignable Node.
+Note that unlike the existing rebalancer, the WAGED rebalancer generates the 
partition assignment with the state assigned.
+
+* **Cluster Context**
+
+Besides node and replica information, we need to record some global 
information in addition. For example, per-fault zone partition lists. All these 
global states go to the Cluster Context.
+
+* **Assignment history**
+
+This includes the Baseline assignment and the previous Best Possible 
assignment.
+Note that unlike the other records, these two states are not available from 
the Cluster Data Cache. The rebalancer uses Assignment Metadata Datastore to 
access the assignments.
+
+### Assignment Metadata Datastore
+
+Conceptually, the Assignment Metadata Datastore is a write-through cache that 
persists the Baseline assignment and the Best Possible assignment. Given a 
large cluster, the persisted data size might be large. We need to evaluate the 
performance impact and optimize the datastore.
+
+### Rebalance Algorithm Adapter
+
+The Rebalance Coordinator calls the rebalance algorithm through a generic 
interface. Our goal is decoupling the rebalancer from the algorithm details. 
Based on the interface, the adapter helps the rebalancer to use different 
algorithms in different scenarios.
+
+**Input of the rebalance interface**
+
+The Cluster Model.
+
+**Output of the rebalance interface**
+
+The partition assignment.
+
+### Constraint-base Rebalance Algorithm
+The constraint-based rebalance algorithm is a greedy algorithm. The basic idea 
is searching all the possible assignment for a good enough one by using a set 
of constraints.
+
+**What is Constraint?**
+
+Hard Constraint - Evaluate a partition allocation and return YES or NO. The 
hard constraints are used as filters. Any proposal fails one or more hard 
constraints will be rejected.
+
+Soft Constraint - Evaluate a partition allocation and return a score within 
the normalized range.
+
+Constraint Importance Factor (CIF) - The rebalance algorithm aggregates the 
soft constraint results according to their CIF. CIF is basically the weight of 
constraint. We rename it to be CIF so as to avoid confusion between the 
constraint weight and the partition weight. Note that CIF is not applicable to 
the hard constraints, because their results are either 0 or 1.

Reply via email to