[jira] Commented: (HBASE-3420) Handling a big rebalance, we can queue multiple instances of a Close event; messes up state

Jonathan Gray (JIRA) Wed, 05 Jan 2011 11:45:10 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977930#action_12977930
 ]


Jonathan Gray commented on HBASE-3420:
--------------------------------------

It's unrelated to the notion of "checkins" (which is almost completely gone 
now) so not sure why we would reuse this config param.  We could set per-RS 
limits but that would probably require significantly more hack-up of the 
balancing algo.

> Handling a big rebalance, we can queue multiple instances of a Close event; 
> messes up state
> -------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3420
>                 URL: https://issues.apache.org/jira/browse/HBASE-3420
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.0
>            Reporter: stack
>             Fix For: 0.90.1
>
>         Attachments: 3420.txt
>
>
> This is pretty ugly.  In short, on a heavily loaded cluster, we are queuing 
> multiple instances of region close.  They all try to run confusing state.
> Long version:
> I have a messy cluster.  Its 16k regions on 8 servers.  One node has 5k or so 
> regions on it.  Heaps are 1G all around.  My master had OOME'd.  Not sure why 
> but not too worried about it for now.  So, new master comes up and is trying 
> to rebalance the cluster:
> {code}
> 2011-01-05 00:48:07,385 INFO org.apache.hadoop.hbase.master.LoadBalancer: 
> Calculated a load balance in 14ms. Moving 3666 regions off of 6 overloaded 
> servers onto 3 less loaded servers
> {code}
> The balancer ends up sending many closes to a single overloaded server are 
> taking so long, the close times out in RIT.  We then do this:
> {code}
>               case CLOSED:
>                 LOG.info("Region has been CLOSED for too long, " +
>                     "retriggering ClosedRegionHandler");
>                 AssignmentManager.this.executorService.submit(
>                     new ClosedRegionHandler(master, AssignmentManager.this,
>                         regionState.getRegion()));
>                 break;
> {code}
> We queue a new close (Should we?).
> We time out a few more times (9 times) and each time we queue a new close.
> Eventually the close succeeds, the region gets assigned a new location.
> Then the next close pops off the eventhandler queue.
> Here is the telltale signature of stuff gone amiss:
> {code}
> 2011-01-05 00:52:19,379 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
> was=TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041. 
> state=OPEN, ts=1294188709030
> {code}
> Notice how state is OPEN when we are forcing offline (It was actually just 
> successfully opened).  We end up assigning same server because plan was still 
> around:
> {code}
> 2011-01-05 00:52:20,705 WARN 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted 
> open of TestTable,0487405776,1294125523541.b1fa38bb610943e9eadc604babe4d041. 
> but already online on this server
> {code}
> But later when plan is cleared, we assign new server and we have 
> dbl-assignment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3420) Handling a big rebalance, we can queue multiple instances of a Close event; messes up state

Reply via email to