[
https://issues.apache.org/jira/browse/HELIX-400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630532#comment-15630532
]
ASF GitHub Bot commented on HELIX-400:
--------------------------------------
GitHub user mkscrg opened a pull request:
https://github.com/apache/helix/pull/58
helix-core: AutoRebalancer should include only numbered states in
`currentMapping`
AutoRebalancer constructs a `currentMapping` (`Map<PartitionId,
Map<ParticipantId, State>>`) which it passes to
`AutoRebalanceStrategy#computePartitionAssignment()`. `ARS` uses the mapping to
sort the live nodes by # of partitions they hold.
In `helix-0.6.x`, `currentMapping` includes _all states_, including "null"
states like `DROPPED` or `OFFLINE`. This breaks `ARS`'s node sorting, causing
it to incorrectly move partitions when nodes restart after disconnecting.
`helix-0.7.x` does not have this issue. It was introduced between
`0.6.2-incubating` and `0.6.3`:
> [[HELIX-400] Remove all references to the old full auto rebalancing
code](https://github.com/apache/helix/commit/8d99778a30d10f529ee0757286efa84ea581b5bf)
See also
- the recent port of [HELIX-543] (#56) to `helix-0.6.x`, which intended to
avoid unnecessary partition movement. That port was ineffective due to this
issue.
- [mailing
list](http://mail-archives.apache.org/mod_mbox/helix-user/201610.mbox/%3CCAC56g41ejjcSi1P-Ohp3esyGqemBgFoji2Gy8tZQnJMo156OpA%40mail.gmail.com%3E)
thread for more background
### Example
Consider this scenario:
```
OnlineOffline state model
2 nodes "NODE_0" and "NODE_1"
1 resource "P" w/ 1 replica, 1 partition
----------
rebalance
> currentMapping: `{P: {NODE_0: ONLINE}}`
stop NODE_0
> currentMapping: `{P: {NODE_1: ONLINE}}`
start NODE_0
> currentMapping: `{P: {NODE_0: OFFLINE, NODE_1: ONLINE}}`
```
`ARS#computePartitionAssignment()` sorts the live nodes by the # of
partitions they hold, based on `currentMapping`, then reassigns partitions
based on that sort. (The sort breaks ties by comparing the node names.) So
after restarting `NODE_0`, the sort is `[NODE_0, NODE_1]`, and the `ONLINE`
partition is incorrectly moved back to `NODE_0`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mkscrg/helix rebalance-numbered-states-only
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/helix/pull/58.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #58
----
commit 131e67bd7d98ae18eb4bbe0356cdd3a088f12c18
Author: Mike Craig <[email protected]>
Date: 2016-11-02T20:22:11Z
helix-core: AutoRebalancer should include only numbered states in
`currentMapping`
----
> 0.6.x still calls the old rebalancing algorithm for no reason
> -------------------------------------------------------------
>
> Key: HELIX-400
> URL: https://issues.apache.org/jira/browse/HELIX-400
> Project: Apache Helix
> Issue Type: Sub-task
> Reporter: Kanak Biscuitwala
> Assignee: Kanak Biscuitwala
> Fix For: 0.6.3
>
>
> After calling the new algorithm, the old algorithm is called. Typically this
> is a no-op, except in the case of disabled partitions, where it might do the
> wrong thing. In any case, this shouldn't exist.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)