[
https://issues.apache.org/jira/browse/HBASE-615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599021#action_12599021
]
Bryan Duxbury commented on HBASE-615:
-------------------------------------
There's two ways this could go. Either we can fix it by widening the margin by
which you have to be "over" the load average before we start rebalancing region
assignments, or we have to redo this code to be more sensitive to the load as
it is being computed on the assignment side of things.
I'll do the first no matter what as a test because it will be very simple.
However, if the problem is that the definitions of "overloaded" from the
perspective of load reported by the regionserver (this is some combo of regions
and requests right now, yes?) and the math done by the rebalancing code are
inherently different, then we'll need to either make the existing load
balancing on assignment dumber or the rebalancing smarter. In the long run,
we're definitely going to want to do the latter, but it requires us to start
tracking requests and anything else that goes into the balancing computation at
the region level, as well as actually reporting that information when the
workers check in with their short list of reassignable regions. That way, when
we're deciding how many regions to unassign, we can make informed decisions,
rather than just trying to equalize averages.
> Region balancer oscillates during cluster startup
> -------------------------------------------------
>
> Key: HBASE-615
> URL: https://issues.apache.org/jira/browse/HBASE-615
> Project: Hadoop HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.2.0
> Reporter: Jim Kellerman
> Assignee: Bryan Duxbury
> Fix For: 0.2.0
>
>
> When starting a cluster with four region servers and a large table (49
> regions) (+root +meta) = 51 total regions, the region balancer oscillates for
> a very long time and does not seem to reach a steady state.
> Additionally, for whatever reason, it seems reluctant to assign regions to
> the first of four region servers, which may be the root cause. In my test,
> the first server had 10 regions assigned, the second and fourth had 13
> regions assigned, and the master would continually assign and deassign 2
> regions to the third server, which oscillated between 13 and 15 regions. If
> it assigned the two fluctuating regions to the first server, it would achieve
> the best balance possible: 12, 13, 13, 13.
> After 20 minutes, it had not stopped oscillating. An application trying to
> work against this cluster would run very slowly as it would be continually
> re-finding the two regions in flux.
> When the table was being created, regions were nicely balanced. On restart,
> however, it just would not settle down.
> Perhaps the balancer should set a target number of regions for each server
> which when the server achieved +/- 1 regions, the rebalancer would not try to
> change unless the number of regions changed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.