[
https://issues.apache.org/jira/browse/HBASE-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627907#action_12627907
]
Billy Pearson commented on HBASE-862:
-------------------------------------
+1 I see this also.
I also see MR jobs fail often if I add a region server to the cluster while the
job is running.
I thank this is sometimes from closing regions that are running a timely
compaction and will not close for a while to be redeployed.
What about when we send the request to close a region make it different from
normal close call and give the region server a option to decline the request
example say the master sends a request to close a small group of regions to
redeploy and the region server have 1 or more of the regions queued up for
compaction
let the region server send a request back to the master declining the regions
that are in the compaction queue or if they have a open scanner on them etc...
also I would slow down the redeploy of the regions to 1-3 in a cycle where we
wait until all the regions are open again before moving more.
We also might build in some give in the numbers per server to make it less
likely to move a region if one of the servers is 1-3 regions or 1-5% out of
balance.
I would like to see the balancer keep everything even but I would be ok with it
leavening it a little out of balance.
Maybe we can use something like the lease timeout var from the config to define
how often the balancer runs a cycle.
My down the road wish list is one day be able report back to the master in the
heartbeat the load on the regions that a region server has and generate a
read/write load numbers per region/table/server/cluster/etc..
With this data we could be more sophisticated on what regions to move and when.
> region balancing is clumsy
> --------------------------
>
> Key: HBASE-862
> URL: https://issues.apache.org/jira/browse/HBASE-862
> Project: Hadoop HBase
> Issue Type: Bug
> Reporter: stack
>
> Daniel Leffel has an install of 500 regions on 4 nodes. He's running 0.2.0.
> On restart, load balancing is running while the 600 regions are being
> initially opened. Makes for churn. Load balancing should wait before it
> cuts in.
> Have also seen on occasion that it will not find equilibrium after a restart.
> Adding a node is catastrophic. >20% of the regions were closed and were
> taking the longest time to show up on the new server. I would think that the
> region balancing would work in more sophisticated and gradual manner.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.