[ 
https://issues.apache.org/jira/browse/HBASE-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627907#action_12627907
 ] 

Billy Pearson commented on HBASE-862:
-------------------------------------

+1 I see this also. 

I also see MR jobs fail often if I add a region server to the cluster while the 
job is running.
I thank this is sometimes from closing regions that are running a timely 
compaction and will not close for a while to be redeployed.

What about when we send the request to close a region make it different from 
normal close call and give the region server a option to decline the request
example say the master sends a request to close a small group of regions to 
redeploy and the region server have 1 or more of the regions queued up for 
compaction
let the region server send a request back to the master declining the regions 
that are in the compaction queue or if they have a open scanner on them etc...

also I would slow down the redeploy of the regions to 1-3 in a cycle where we 
wait until all the regions are open again before moving more.
We also might build in some give in the numbers per server to make it less 
likely to move a region if one of the servers is 1-3 regions or 1-5%  out of 
balance.
I would like to see the balancer keep everything even but I would be ok with it 
leavening it a little out of balance.
Maybe we can use something like the lease timeout var from the config to define 
how often the balancer runs a cycle.

My down the road wish list is one day be able report back to the master in the 
heartbeat the load on the regions that a region server has and generate a 
read/write load numbers per region/table/server/cluster/etc..
With this data we could be more sophisticated on what regions to move and when.


> region balancing is clumsy
> --------------------------
>
>                 Key: HBASE-862
>                 URL: https://issues.apache.org/jira/browse/HBASE-862
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>
> Daniel Leffel has an install of 500 regions on 4 nodes.  He's running 0.2.0.
> On restart, load balancing is running while the 600 regions are being 
> initially opened.  Makes for churn.  Load balancing should wait before it 
> cuts in.
> Have also seen on occasion that it will not find equilibrium after a restart.
> Adding a node is catastrophic.  >20% of the regions were closed and were 
> taking the longest time to show up on the new server.  I would think that the 
> region balancing would work in more sophisticated and gradual manner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to