[jira] Commented: (HBASE-3420) Handling a big rebalance, we can queue multiple instances of a Close event; messes up state

stack (JIRA) Wed, 05 Jan 2011 10:13:11 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977890#action_12977890
 ]


stack commented on HBASE-3420:
------------------------------

Its timeout of a close.  Here is sequence:

{code}
2011-01-05 00:49:37,670 INFO org.apache.hadoop.hbase.master.HMaster: balance 
hri=TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041., 
src=sv2borg181,60020,1294096110452, dest=sv2borg188,60020,1294187735582
2011-01-05 00:49:37,670 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Starting unassignment of region 
TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041. (offlining)
2011-01-05 00:49:37,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Sent CLOSE to serverName=sv2borg181,60020,1294096110452, load=(requests=0, 
regions=0, usedHeap=0, maxHeap=0) for region 
TestTable,0487405776,1294125523541.                              
b1fa38bb610943e9eadc604babe4d041.
2011-01-05 00:49:38,310 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
master:60000-0x12d3de9e7c60e37 Retrieved 112 byte(s) of data from znode 
/hbase/unassigned/b1fa38bb610943e9eadc604babe4d041 and set watcher; 
region=TestTable,0487405776,1294125523541.           
b1fa38bb610943e9eadc604babe4d041., server=sv2borg181,60020,1294096110452, 
state=RS_ZK_REGION_CLOSED
2011-01-05 00:49:38,385 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling new unassigned node: 
/hbase/unassigned/b1fa38bb610943e9eadc604babe4d041 
(region=TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041., 
server=sv2borg181,60020,  1294096110452, state=RS_ZK_REGION_CLOSED)
2011-01-05 00:49:38,385 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_CLOSED, server=sv2borg181,60020,1294096110452, 
region=b1fa38bb610943e9eadc604babe4d041
2011-01-05 00:50:12,412 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Regions in transition timed out:  
TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041. 
state=CLOSED, ts=1294188578211
2011-01-05 00:50:12,412 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Region has been CLOSED for too long, retriggering ClosedRegionHandler
{code}


> Handling a big rebalance, we can queue multiple instances of a Close event; 
> messes up state
> -------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3420
>                 URL: https://issues.apache.org/jira/browse/HBASE-3420
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.0
>            Reporter: stack
>             Fix For: 0.90.1
>
>
> This is pretty ugly.  In short, on a heavily loaded cluster, we are queuing 
> multiple instances of region close.  They all try to run confusing state.
> Long version:
> I have a messy cluster.  Its 16k regions on 8 servers.  One node has 5k or so 
> regions on it.  Heaps are 1G all around.  My master had OOME'd.  Not sure why 
> but not too worried about it for now.  So, new master comes up and is trying 
> to rebalance the cluster:
> {code}
> 2011-01-05 00:48:07,385 INFO org.apache.hadoop.hbase.master.LoadBalancer: 
> Calculated a load balance in 14ms. Moving 3666 regions off of 6 overloaded 
> servers onto 3 less loaded servers
> {code}
> The balancer ends up sending many closes to a single overloaded server are 
> taking so long, the close times out in RIT.  We then do this:
> {code}
>               case CLOSED:
>                 LOG.info("Region has been CLOSED for too long, " +
>                     "retriggering ClosedRegionHandler");
>                 AssignmentManager.this.executorService.submit(
>                     new ClosedRegionHandler(master, AssignmentManager.this,
>                         regionState.getRegion()));
>                 break;
> {code}
> We queue a new close (Should we?).
> We time out a few more times (9 times) and each time we queue a new close.
> Eventually the close succeeds, the region gets assigned a new location.
> Then the next close pops off the eventhandler queue.
> Here is the telltale signature of stuff gone amiss:
> {code}
> 2011-01-05 00:52:19,379 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
> was=TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041. 
> state=OPEN, ts=1294188709030
> {code}
> Notice how state is OPEN when we are forcing offline (It was actually just 
> successfully opened).  We end up assigning same server because plan was still 
> around:
> {code}
> 2011-01-05 00:52:20,705 WARN 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted 
> open of TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041. 
> but already online on this server
> {code}
> But later when plan is cleared, we assign new server and we have 
> dbl-assignment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3420) Handling a big rebalance, we can queue multiple instances of a Close event; messes up state

Reply via email to