[
https://issues.apache.org/jira/browse/HBASE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977890#action_12977890
]
stack commented on HBASE-3420:
------------------------------
Its timeout of a close. Here is sequence:
{code}
2011-01-05 00:49:37,670 INFO org.apache.hadoop.hbase.master.HMaster: balance
hri=TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041.,
src=sv2borg181,60020,1294096110452, dest=sv2borg188,60020,1294187735582
2011-01-05 00:49:37,670 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Starting unassignment of region
TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041. (offlining)
2011-01-05 00:49:37,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Sent CLOSE to serverName=sv2borg181,60020,1294096110452, load=(requests=0,
regions=0, usedHeap=0, maxHeap=0) for region
TestTable,0487405776,1294125523541.
b1fa38bb610943e9eadc604babe4d041.
2011-01-05 00:49:38,310 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil:
master:60000-0x12d3de9e7c60e37 Retrieved 112 byte(s) of data from znode
/hbase/unassigned/b1fa38bb610943e9eadc604babe4d041 and set watcher;
region=TestTable,0487405776,1294125523541.
b1fa38bb610943e9eadc604babe4d041., server=sv2borg181,60020,1294096110452,
state=RS_ZK_REGION_CLOSED
2011-01-05 00:49:38,385 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Handling new unassigned node:
/hbase/unassigned/b1fa38bb610943e9eadc604babe4d041
(region=TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041.,
server=sv2borg181,60020, 1294096110452, state=RS_ZK_REGION_CLOSED)
2011-01-05 00:49:38,385 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Handling transition=RS_ZK_REGION_CLOSED, server=sv2borg181,60020,1294096110452,
region=b1fa38bb610943e9eadc604babe4d041
2011-01-05 00:50:12,412 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Regions in transition timed out:
TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041.
state=CLOSED, ts=1294188578211
2011-01-05 00:50:12,412 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Region has been CLOSED for too long, retriggering ClosedRegionHandler
{code}
> Handling a big rebalance, we can queue multiple instances of a Close event;
> messes up state
> -------------------------------------------------------------------------------------------
>
> Key: HBASE-3420
> URL: https://issues.apache.org/jira/browse/HBASE-3420
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.90.0
> Reporter: stack
> Fix For: 0.90.1
>
>
> This is pretty ugly. In short, on a heavily loaded cluster, we are queuing
> multiple instances of region close. They all try to run confusing state.
> Long version:
> I have a messy cluster. Its 16k regions on 8 servers. One node has 5k or so
> regions on it. Heaps are 1G all around. My master had OOME'd. Not sure why
> but not too worried about it for now. So, new master comes up and is trying
> to rebalance the cluster:
> {code}
> 2011-01-05 00:48:07,385 INFO org.apache.hadoop.hbase.master.LoadBalancer:
> Calculated a load balance in 14ms. Moving 3666 regions off of 6 overloaded
> servers onto 3 less loaded servers
> {code}
> The balancer ends up sending many closes to a single overloaded server are
> taking so long, the close times out in RIT. We then do this:
> {code}
> case CLOSED:
> LOG.info("Region has been CLOSED for too long, " +
> "retriggering ClosedRegionHandler");
> AssignmentManager.this.executorService.submit(
> new ClosedRegionHandler(master, AssignmentManager.this,
> regionState.getRegion()));
> break;
> {code}
> We queue a new close (Should we?).
> We time out a few more times (9 times) and each time we queue a new close.
> Eventually the close succeeds, the region gets assigned a new location.
> Then the next close pops off the eventhandler queue.
> Here is the telltale signature of stuff gone amiss:
> {code}
> 2011-01-05 00:52:19,379 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE;
> was=TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041.
> state=OPEN, ts=1294188709030
> {code}
> Notice how state is OPEN when we are forcing offline (It was actually just
> successfully opened). We end up assigning same server because plan was still
> around:
> {code}
> 2011-01-05 00:52:20,705 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted
> open of TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041.
> but already online on this server
> {code}
> But later when plan is cleared, we assign new server and we have
> dbl-assignment.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.