This is an automated email from the ASF dual-hosted git repository.
jiajunwang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/helix.wiki.git
The following commit(s) were added to refs/heads/master by this push:
new 0195fcd Created Weight-Aware Globally-Even Distribute Rebalancer
Design Proposal (markdown)
0195fcd is described below
commit 0195fcdfd326dca72c589e41562260a0a153b3e5
Author: jiajunwang <[email protected]>
AuthorDate: Thu Jul 11 12:23:29 2019 -0700
Created Weight-Aware Globally-Even Distribute Rebalancer Design Proposal
(markdown)
---
...y-Even-Distribute-Rebalancer-Design-Proposal.md | 215 +++++++++++++++++++++
1 file changed, 215 insertions(+)
diff --git
a/Weight-Aware-Globally-Even-Distribute-Rebalancer-Design-Proposal.md
b/Weight-Aware-Globally-Even-Distribute-Rebalancer-Design-Proposal.md
new file mode 100644
index 0000000..27869c8
--- /dev/null
+++ b/Weight-Aware-Globally-Even-Distribute-Rebalancer-Design-Proposal.md
@@ -0,0 +1,215 @@
+# Introduction
+We propose a new weight-aware globally-even distribute Rebalancer to better
meet the application's requirement. Compared with the existing CRUSH-based
Full-Auto rebalancers, the WAGED rebalancer will have the following major
improvements.
+
+* Partition weight-aware
+
+The WAGED rebalancer calculates partition assignment according to the various
resource usage of partitions.
+Helix users can configure multi-dimensional weights of the partition's
resource usage. In addition, they will set up the corresponding instance
capacities.
+
+* Globally-even distribution
+
+The WAGED rebalancer optimizes the partition assignment in terms of total
partition weights across all the instances. This means all the partitions
(replications) in different Helix resources are rebalanced together. While
these partitions share the same physical resource, the rebalancer should assign
partitions based on the global resource usage view.
+
+* The trade-off between partition movements and even distribution
+
+The WAGED rebalancer minimizes unnecessary partition movements. In addition,
it allows the Helix users to specify a preference between the evenness and
partition movement. In short, the rebalancer can support strict even
distribution but will cause more partition movements. Or it can reduce the
partition movements but the distribution becomes less even. In the production
environment, a stateless system would prefer evenness since partition movement
cost is very little. On the other hand, [...]
+Moreover, the new rebalancer will support cluster reconfiguration gracefully.
It minimizes the effort required to do the migration. The WAGED rebalancer
remembers the rebalance state and adjusts the assignment incrementally.
+
+* Flexibility
+
+The WAGED rebalancer is extendable. It can be configured or extended to fit
different rebalance requirements. Interfaces of the WAGED rebalancer will be
more generic. The developers can even extend the rebalance algorithm without
modifying the rebalancer's main logic.
+
+# Proposed Design
+
+## Objectives
+
+### Partition Weight-Aware Globally-Even Distribute
+
+As the name suggested, "Weight-Aware" and "Globally-Even" are the most
important goals for the new rebalancer.
+Firstly, partition weight-awareness enables the rebalancer to understand the
real system workload. The partition weight should be customizable and
multi-dimensional. Helix users need to specify per-partition weight based on
the expected system resource usage. Such as CPU, memory, storage usage. In
addition, correspondingly to the weights, the users are required to specify the
instance capacity. As an example, we will see the weight/capacity
configurations as shown below.
+
+> Capacity / weight example
+> Instance Capacity:
+> - CPU: 1000
+> - MEMORY: 5000
+> - STORAGE: 500
+>
+> DB Partition Weight:
+> - CPU: 50
+> - MEMORY: 100
+> - STORAGE: 10
+
+As for globally-even distribution, we expect all the instances in the cluster
having the uniform workload distribution. Actually, global evenness is
increasingly important when the rebalancer calculates based on
multi-dimensional usage. For example, if two Helix resources have very
different usage requirement, the rebalancer can co-locate their partitions so
the overall system utilization will be improved.
+
+### Avoid Extra Partition Movements
+
+We want a sticky and consistent partition assignment. To Helix, this means
minimal management cost. However, strictly no movement is not optimal as well.
First of all, the partitions on the deactivated instances have to be moved out.
Secondly, the rebalancer might need to move some partitions for even
distribution.
+
+To be more accurate, the goal is to minimize the EXTRA partition movement. By
"extra", we mean the partition movements that are between two unchanged and
load-balanced instances. In theory, these movements are not necessary.
+
+Note the difference between the extra movement and the optional movement. For
example, once a resource is removed, should the rebalancer trigger partition
movements to fill the instances that become idle? These partition movements are
optional but might benefit the even distribution of workload. The WAGED
rebalancer will avoid extra partition movement. But it will allow the optional
movement if evenness is required.
+
+### Backward Compatibility
+
+It is very important that the WAGED rebalancer keeps the same assumptions and
capabilities as the existing rebalancers. For example, rack-aware, delayed
rebalancing, partition movement throttling, etc. If any of these features are
missing, the application might need to change their logic for the migrating.
+
+Moreover, besides the rebalance logic, the WAGED rebalancer should calculate
the result as fast as the existing ones. If an emergency happens, the
rebalancer needs to calculate the new mapping within 100ms or less. Otherwise,
there is not enough time to proceed with the Best Possible assignment and send
state transition messages. Considering that the WAGED rebalancer needs to
process more information, our goal is not to shorten the calculation time but
keep it the same as it is now.
+
+Last but not least, the WAGED rebalancer should be easy to use and fully
integrate with the Helix Full-Auto rebalance mode. There should be minor user
application logic change. And the additional steps that are required to
activate the new rebalancer only happens once during the initial configuration.
+
+## Design Overview
+
+We propose to implement the WAGED rebalancer based on the constraint-based
rebalance algorithm. The basic idea of the constraint-based algorithm is
evaluating each possible partition allocation by a set of constraints. Then,
the partition is assigned to the instance that has the highest evaluated score.
+One of the reasons to choose the constraint-based rebalance algorithm is that
it calculates fast. The time complexity is O(number_constraint *
number_total_replication * number_instance). It helps the Helix to rebalance
fast.
+In addition, the constraint-based algorithm has a flexible and extendable
framework. For example, to make the rebalancer aware of partition weight, we
just need to add a new constraint about partition weight.
+
+However, only the constraint-based rebalance algorithm is not enough. Our
current rebalancer's workflow has several limitations.
+For example, the pipeline triggers rebalance for every single resource
repeatedly and separately. Although the input of rebalance call contains the
complete Cluster Data Cache, the cache object does no track the ongoing
partition assignment. So it is not feasible to do globally-even rebalance.
Another example, the rebalance pipeline is executed on any type of cluster
change events. Many of those events, such as the Current State change, should
not impact the assignment. However, since th [...]
+Because of these concerns, we propose to implement a new rebalancer instead of
developing a new Full-Auto rebalance strategy. The Helix event processing
pipeline will also be refactored accordingly.
+
+## Quick Tutorial
+
+The minimal set of steps to activate the WAGED rebalancer.
+
+1. Configuring the capacity keys in the Cluster Config. For example,
DISK_USAGE_GB, etc.
+1. As mentioned, Helix does not understand the meaning of these capacity keys.
They will be used to match the instance capacity and partition weight.
+1. Configuring the instance capacity in the Instance Config. For example,
DISK_USAGE_GB = 1000.
+1. Configuring the partition weight in the Resource Config. For example,
DISK_USAGE_GB = 20.
+1. Modifying the Helix resource IdealStates to update with the WAGED
rebalancer classname.
+
+For example,
+
+> {
+> "id" : "DBName_1",
+> "simpleFields" : {
+> "IDEAL_STATE_MODE" : "AUTO_REBALANCE",
+> "MIN_ACTIVE_REPLICAS" : "2",
+> "NUM_PARTITIONS" : "1024",
+> "REBALANCER_CLASS_NAME" :
"org.apache.helix.controller.rebalancer.WagedRebalancer",
+> "REBALANCE_MODE" : "FULL_AUTO",
+> "REPLICAS" : "3",
+> "STATE_MODEL_DEF_REF" : "MasterSlave"
+> },
+> "mapFields" : {},
+> "listFields" : {}
+> }
+
+## Workflow
+To high-levelly understand how the WAGED rebalancer works, please check the
following workflow diagram. Note that this workflow combines multiple
components' functionality.
+
+
+
+## Architecture
+
+
+
+### Rebalance Coordinator
+The Rebalance Coordinator controls the workflow of the rebalance process. It
depends on multiple independent modules to calculate the Best Possible
assignment. As shown in the architecture diagram, after the BestPossibleState
Calculation Stage makes a rebalance call, the coordinator initializes and call
the related components.
+
+**Rebalance Workflow**
+
+The following flowchart demonstrates how the Rebalance Coordinator conducts a
rebalance process.
+As shown in the diagram, we have two different branches both calculating for
one partition assignment. This special design is for achieving fast rebalance
speed while ensuring the eventual assignment optimization.
+
+
+
+**Global Baseline Calculation**
+
+Calculating for a globally optimized partition assignment. The Global Baseline
Calculation does not consider any temporary status, such as participants'
offline/disabled. So it reduces the randomness of partition assignment. The
calculation result is named as the Baseline assignment. It is used as the
anchor of partitions allocation.
+
+The Global Baseline Calculation is only triggered when a substantial permanent
cluster change happens. For example, cluster topology is changed, or new
resources are added. For each calculation, the algorithm will re-assign all the
partitions. Please note this does not imply that all the existing partitions
will be shuffled. The Global Rebalance Calculation takes the previous Baseline
assignment as an input. The intention is to give the algorithm a chance to
re-allocate the partitions if [...]
+
+It is discussed that we leverage a more advanced rebalance algorithm for the
Baseline calculation. For example, machine learning technology. In this case,
the Baseline calculation is very possible to be delayed. Then the WAGED
rebalancer should rely on the Partial Rebalance to calculate an intermediate
result before the Baseline is ready. The rebalancer can always update the Best
Possible assignment to the optimal one once the Baseline calculation is done.
+Overall, the main obstacle is the advanced rebalance algorithm itself. Once we
have a good candidate, it would be easy to plug it into the workflow.
+
+The Baseline is not directly propagated to the final output. It is consumed by
the Partial Rebalance as an important parameter.
+
+**Partial Rebalance**
+
+Calculating for the Best Possible assignment output based on the Baseline and
the previous Best Possible assignment.
+
+Partial Rebalance is triggered on all the substantial cluster changes. Which
include cluster topology change, resource config change, instance state change,
and the Baseline assignment change. For the other trivial system changes such
as Current State change, the Best Possible assignment should be kept the same.
So there is no need to run the rebalance algorithm. The rebalancer directly
returns the previously calculated result. We propose to leverage the
constraint-based rebalance algori [...]
+
+As the name suggested, the Partial Rebalance is done with a certain rebalance
scope. The coordinator compares the previous Best Possible assignment with the
current cluster state so as to derive a minimal rebalance scope. In short, the
rebalance scope only contains the following two types of partitions.
+
+The partition's current assignment becomes invalid.
+The Baseline contains some new partition assignments that do not exist in the
current assignment.
+Note that given multiple changes can happen during the rebalance interval, the
rebalancer will merge the corresponding rebalance scopes and finish the
calculation in one rebalance process.
+
+**About Persisting Assignments**
+
+One concern of persisting the assignment is the additional latency and the
extra ZK throughput. However, considering that it is only required when a new
Best Possible assignment has been calculated, the extra cost would be minor.
Moreover, we plan to further optimize the persisting mechanism by compression.
+
+### Cluster Change Detector
+
+The Cluster Change Detector is responsible for determining what has been
changed. The result will be used to choose the correct rebalance approach. In
general, the detector will keep a previous Cluster Data Cache snapshot. Then it
compares the old snapshot with the new one for the difference. The comparison
is done based on Zookeeper node versions as demonstrated in the following
diagram. A content-based comparison might be necessary for some frequently
modified ZNodes.
+
+
+**Input**
+
+The Cluster Data Cache. Optionally, also input the interest paths when
initializing the detector.
+
+**Output**
+
+The ZNode paths that have been updated after the previous detect call.
+
+Note that during the very first rebalance after a Controller acquires
leadership, the detector won't have the old snapshot. So the rebalancer will
always trigger a globally rebalance. One potential problem is that will the
controller leadership switch cause a large scale partition shuffling? To
prevent this, we need the Baseline assignment and the previous Best Possible
assignment being persisted in Zookeeper. Moreover, the constraint-based
rebalance algorithm only moves partitions whene [...]
+
+### Cluster Data Provider
+
+The Cluster Data Provider generates a Cluster Model based on the Cluster Data
Cache.
+The main reason we cannot use the cache directly is that we need a runtime
object to track the pending assignment changes. Moreover, most of the
information in the Cluster Data Cache is irrelevant to the rebalancer.
+Besides the necessary cluster status information, the Cluster Model contains
additional transient states for optimizing the algorithm. For example, it keeps
tracking the partitions in each fault zone. So when the rebalance algorithm
searches for potential fault zone conflict, it does not need to iterate all the
instances for the partition lists. In addition, even the cluster information in
the Cluster Model might be altered according to the rebalancer's requirement.
For instance, the del [...]
+
+**Input**
+
+Cluster Data Cache.
+
+**Output**
+
+A Cluster Model which contains the following objects.
+
+* **Assignable Node**
+
+Each active instance will be recorded as an assignable node. An assignable
node contains information such as the instance Domain and capacity.
+
+* **Assignable Replica**
+
+Each replication of the partition is considered as an assignable object. The
replica object contains the weight of the partition. We assume all the replicas
in one partition have the same weight. The weight fields should fit the
capacity fields of the Assignable Node.
+Note that unlike the existing rebalancer, the WAGED rebalancer generates the
partition assignment with the state assigned.
+
+* **Cluster Context**
+
+Besides node and replica information, we need to record some global
information in addition. For example, per-fault zone partition lists. All these
global states go to the Cluster Context.
+
+* **Assignment history**
+
+This includes the Baseline assignment and the previous Best Possible
assignment.
+Note that unlike the other records, these two states are not available from
the Cluster Data Cache. The rebalancer uses Assignment Metadata Datastore to
access the assignments.
+
+### Assignment Metadata Datastore
+
+Conceptually, the Assignment Metadata Datastore is a write-through cache that
persists the Baseline assignment and the Best Possible assignment. Given a
large cluster, the persisted data size might be large. We need to evaluate the
performance impact and optimize the datastore.
+
+### Rebalance Algorithm Adapter
+
+The Rebalance Coordinator calls the rebalance algorithm through a generic
interface. Our goal is decoupling the rebalancer from the algorithm details.
Based on the interface, the adapter helps the rebalancer to use different
algorithms in different scenarios.
+
+**Input of the rebalance interface**
+
+The Cluster Model.
+
+**Output of the rebalance interface**
+
+The partition assignment.
+
+### Constraint-base Rebalance Algorithm
+The constraint-based rebalance algorithm is a greedy algorithm. The basic idea
is searching all the possible assignment for a good enough one by using a set
of constraints.
+
+**What is Constraint?**
+
+Hard Constraint - Evaluate a partition allocation and return YES or NO. The
hard constraints are used as filters. Any proposal fails one or more hard
constraints will be rejected.
+
+Soft Constraint - Evaluate a partition allocation and return a score within
the normalized range.
+
+Constraint Importance Factor (CIF) - The rebalance algorithm aggregates the
soft constraint results according to their CIF. CIF is basically the weight of
constraint. We rename it to be CIF so as to avoid confusion between the
constraint weight and the partition weight. Note that CIF is not applicable to
the hard constraints, because their results are either 0 or 1.