[
https://issues.apache.org/jira/browse/HBASE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977930#action_12977930
]
Jonathan Gray commented on HBASE-3420:
--------------------------------------
It's unrelated to the notion of "checkins" (which is almost completely gone
now) so not sure why we would reuse this config param. We could set per-RS
limits but that would probably require significantly more hack-up of the
balancing algo.
> Handling a big rebalance, we can queue multiple instances of a Close event;
> messes up state
> -------------------------------------------------------------------------------------------
>
> Key: HBASE-3420
> URL: https://issues.apache.org/jira/browse/HBASE-3420
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.90.0
> Reporter: stack
> Fix For: 0.90.1
>
> Attachments: 3420.txt
>
>
> This is pretty ugly. In short, on a heavily loaded cluster, we are queuing
> multiple instances of region close. They all try to run confusing state.
> Long version:
> I have a messy cluster. Its 16k regions on 8 servers. One node has 5k or so
> regions on it. Heaps are 1G all around. My master had OOME'd. Not sure why
> but not too worried about it for now. So, new master comes up and is trying
> to rebalance the cluster:
> {code}
> 2011-01-05 00:48:07,385 INFO org.apache.hadoop.hbase.master.LoadBalancer:
> Calculated a load balance in 14ms. Moving 3666 regions off of 6 overloaded
> servers onto 3 less loaded servers
> {code}
> The balancer ends up sending many closes to a single overloaded server are
> taking so long, the close times out in RIT. We then do this:
> {code}
> case CLOSED:
> LOG.info("Region has been CLOSED for too long, " +
> "retriggering ClosedRegionHandler");
> AssignmentManager.this.executorService.submit(
> new ClosedRegionHandler(master, AssignmentManager.this,
> regionState.getRegion()));
> break;
> {code}
> We queue a new close (Should we?).
> We time out a few more times (9 times) and each time we queue a new close.
> Eventually the close succeeds, the region gets assigned a new location.
> Then the next close pops off the eventhandler queue.
> Here is the telltale signature of stuff gone amiss:
> {code}
> 2011-01-05 00:52:19,379 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE;
> was=TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041.
> state=OPEN, ts=1294188709030
> {code}
> Notice how state is OPEN when we are forcing offline (It was actually just
> successfully opened). We end up assigning same server because plan was still
> around:
> {code}
> 2011-01-05 00:52:20,705 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted
> open of TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041.
> but already online on this server
> {code}
> But later when plan is cleared, we assign new server and we have
> dbl-assignment.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.