[
https://issues.apache.org/jira/browse/HBASE-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629060#action_12629060
]
Andrew Purtell commented on HBASE-862:
--------------------------------------
For cluster startup, how about:
* Master waits a short interval for regionserver start messages to arrive.
* After the initial waiting period has elapsed, begin assigning regions based
on the current count of announced regionservers.
* If additional servers report in, try to get their region count up to the
current average before assigning additional regions to earlier reporters.
* Once all regions have been assigned, wait a dampening period before starting
the balancer.
Combine the above with changes to start-hbase.sh and hbase-daemons.sh to start
the regionservers ahead of the master and start the regionservers in parallel
rather than serially, and I think the startup behavior will improve.
Balancing regions in steady state, especially if there are late or new arrivals
to the cluster, is a different proposition I think. Billy had some ideas about
that in an earlier comment. Also some special case handling of META is in
order.
Concerning balancing in steady state, 'load balancing' attempts to ensure that
the workload on each host is within a small degree of the workload present on
every other host in the system. 'Load leveling' on the other hand is a more
relaxed approach that only seeks to avoid congestion on any one host. Balancing
is proactive. Leveling is reactive. I think both achieve the same end over time
(with balancing "trying harder") and since leveling is simpler and requires
little work or coordination on the part of the master, I thought I'd try that
first.
> region balancing is clumsy
> --------------------------
>
> Key: HBASE-862
> URL: https://issues.apache.org/jira/browse/HBASE-862
> Project: Hadoop HBase
> Issue Type: Bug
> Reporter: stack
>
> Daniel Leffel has an install of 500 regions on 4 nodes. He's running 0.2.0.
> On restart, load balancing is running while the 600 regions are being
> initially opened. Makes for churn. Load balancing should wait before it
> cuts in.
> Have also seen on occasion that it will not find equilibrium after a restart.
> Adding a node is catastrophic. >20% of the regions were closed and were
> taking the longest time to show up on the new server. I would think that the
> region balancing would work in more sophisticated and gradual manner.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.