Tom Widmer created HELIX-543:
--------------------------------
Summary: Single partition unnecessarily moved
Key: HELIX-543
URL: https://issues.apache.org/jira/browse/HELIX-543
Project: Apache Helix
Issue Type: Bug
Components: helix-core
Affects Versions: 0.6.4, 0.7.1
Reporter: Tom Widmer
Priority: Minor
(Copied from mailing list)
I have some resources that I use with the OnlineOffine state but which only
have a single partition at the moment (essentially, Helix is just giving me a
simple leader election to decide who controls the resource - I don’t care which
participant has it, as long as only one does). However, with full auto
rebalance, I find that the ‘first’ instance (alphabetically I think) always
gets the resource when it’s up. So if I take down the first node so the
partition transfers to the 2nd node, then bring back up the 1st node, the
resource transfers back unnecessarily.
Note that this issue also affects multi-partition resources, it’s just a bit
less noticeable (it means that with 3 nodes and 4 partitions, say, the
partitions are always allocated 2, 1, 1, so if you have node 1 down and hence
0, 2, 2, and then bring up node 1, it unnecessarily moves 2 partitions to make
2, 1, 1 rather than the minimum move to achieve ‘balance’ which would be to
move 1 partition from instance 2 or 3 back to instance 1.
I can see the code in question in
AutoRebalanceStrategy.typedComputePartitionAssignment, where the distRemainder
is allocated to the first nodes alphabetically, so that the capacity of all
nodes is not equal.
The proposed solution is to sort the nodes by the number of partitions they
already have assigned, which should mean that those nodes are assigned the
higher capacity and the problem goes away.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)