[
https://issues.apache.org/jira/browse/HBASE-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-3039:
-------------------------
Attachment: 3039.txt
Here is fix... remove stuff from regionsintransition on receipt of split
message. This will do for now but I think there are likely other holes in
state transition probably around split since this is the one action the master
does not control. Plugging the holes is easier in new master. Just have to
find them.
Here is what patch does. I'm testing it now.
{code}
M src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
Add region name to warning log message (w/o it message is no good).
M src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
Add src of split message else need to deduce where it came from by looking
elsewhere.
M src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
Updated log messages to include region and where appropritate source
server name; debug hard w/o
Changed regionOnline and regionOffline to check for unexpected
states and log warnings rather than proceed regardless.
Added in fix for concurrent balance+split; split message now
updates regionsintransition where previous it did not.
Remove checkRegion method. Its a reimplementation of
what regionOnline and regionOffline do only less comprehensive
regards what gets updated (this.regions + this.servers rather
than this.regions, this.servers and regionsInTransition)
That they were less comprehensive is root of this bug.
M src/main/java/org/apache/hadoop/hbase/master/HMaster.java
Make the message about why we are not running balancer richer
(print out how many reigons in transition and more of the
regionsintrnasition list).
M src/main/java/org/apache/hadoop/hbase/executor/RegionTransitionData.java
Javadoc and minor formatting.
{code}
> Stuck in regionsInTransition because rebalance came in at same time as a split
> ------------------------------------------------------------------------------
>
> Key: HBASE-3039
> URL: https://issues.apache.org/jira/browse/HBASE-3039
> Project: HBase
> Issue Type: Bug
> Components: master
> Reporter: stack
> Fix For: 0.90.0
>
> Attachments: 3039.txt
>
>
> Saw this doing cluster tests:
> {code}
> 2010-09-25 21:31:48,212 DEBUG org.apache.hadoop.hbase.master.HMaster: Not
> running balancer because regions in transition:
> {73781e505e452221c9cd0e03585eb5d1=usertable,user800184056,
> 128...
> {code}
> Here's the problem:
> {code}
> 2010-09-25 08:16:48,186 INFO org.apache.hadoop.hbase.master.HMaster: balance
> hri=usertable,user800184056,1285397376525.73781e505e452221c9cd0e03585eb5d1.,
> src=su184,60020,
> 1285371621579, dest=sv2borg189,60020,1285371621577
> 2010-09-25 08:16:48,186 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of
> region usertable,user800184056,1285397376525.
> 73781e505e452221c9cd0e03585eb5d1. (offlining)
> 2010-09-25 08:16:52,656 INFO org.apache.hadoop.hbase.master.ServerManager:
> Received REGION_SPLIT:
> usertable,user800184056,1285397376525.73781e505e452221c9cd0e03585eb5d1.:
>
> Daughters;
> usertable,user800184056,1285402609029.c05825561e7ea3cc6507c70bfb21541a.,
> usertable,user804024623,1285402609029.28f64903a7875bdafc1e7ee344b225b0.
> 2010-09-25 08:17:11,414 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed
> out: usertable,user800184056,1285397376525.
> 73781e505e452221c9cd0e03585eb5d1. state=PENDING_CLOSE, ts=1285402608186
> {code}
> ....just as we were doing a balance, the region split.
> Over on RS, I see the split starting up and then in comes the balance 'close'
> message. By the time the close handler runs on regionserver the split is
> well underway and close handler actually doesn't find an online region to
> split.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.