[
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068843#comment-13068843
]
gaojinchao commented on HBASE-4064:
-----------------------------------
I reproduced the issue by the above code. and modifed for below code, the issue
can't reproduce.(about 2000 region)
So I think V2 is good..
public void regionOffline(final HRegionInfo regionInfo) {
// remove the region plan as well just in case.
clearRegionPlan(regionInfo);
setOffline(regionInfo);
try{
Thread.sleep(1000);
}catch(Throwable e){
;
}
synchronized(this.regionsInTransition) {
if (this.regionsInTransition.remove(regionInfo.getEncodedName()) != null)
{
this.regionsInTransition.notifyAll();
}
}
}
I don't collect the logs. So I produce it again. the logs:
//disable table start.
hbase(main):002:0> disable 'ufdr5'
11/07/21 15:07:25 DEBUG zookeeper.ZKUtil: hconnection-0x314b823b0a0005 Set
watcher on existing znode /hbase/root-region-server
hmaster logs:
2011-07-21 15:10:42,300 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Regions in transition timed out:
ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a. state=CLOSED,
ts=1311232060203
2011-07-21 15:10:42,300 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Region 32cb070f97e3b7b0d261f427a0fc110a has been CLOSED for too long, waiting
on queued ClosedRegionHandler to run or server shutdown
2011-07-21 15:12:25,672 DEBUG
org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED
event for 32cb070f97e3b7b0d261f427a0fc110a
2011-07-21 15:12:25,672 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Table being disabled so deleting ZK node and removing from regions in
transition, skipping assignment of region
ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a.
2011-07-21 15:12:25,672 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:60000-0x314b823b0a0000 Deleting existing unassigned node for
32cb070f97e3b7b0d261f427a0fc110a that is in expected state RS_ZK_REGION_CLOSED
2011-07-21 15:12:25,673 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil:
master:60000-0x314b823b0a0000 Retrieved 107 byte(s) of data from znode
/hbase/unassigned/32cb070f97e3b7b0d261f427a0fc110a;
data=region=ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a.,
server=C4C2.site,60020,1311231851946, state=RS_ZK_REGION_CLOSED
2011-07-21 15:12:25,688 DEBUG
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:
master:60000-0x314b823b0a0000 Received ZooKeeper Event, type=NodeDeleted,
state=SyncConnected, path=/hbase/unassigned/32cb070f97e3b7b0d261f427a0fc110a
2011-07-21 15:12:25,688 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:60000-0x314b823b0a0000 Successfully deleted unassigned node for region
32cb070f97e3b7b0d261f427a0fc110a in expected state RS_ZK_REGION_CLOSED
2011-07-21 15:12:26,383 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Starting unassignment of region
ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a. (offlining)
2011-07-21 15:12:26,391 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Server serverName=C4C2.site,60020,1311231851946, load=(requests=0, regions=0,
usedHeap=0, maxHeap=0) returned
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Received close for
ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a. but we are not
serving it for 32cb070f97e3b7b0d261f427a0fc110a
2011-07-21 15:15:32,300 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Regions in transition timed out:
ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a.
state=PENDING_CLOSE, ts=1311232346385
2011-07-21 15:15:32,301 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Region has been PENDING_CLOSE for too long, running forced unassign again on
region=ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a.
2011-07-21 15:15:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil:
master:60000-0x314b823b0a0000 Set watcher on existing znode
/hbase/unassigned/32cb070f97e3b7b0d261f427a0fc110a
2011-07-21 15:15:32,303 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Starting unassignment of region
ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a. (offlining)
2011-07-21 15:15:32,303 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Attempted to unassign region
ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a. but it is not
currently assigned anywhere
2011-07-21 15:15:42,300 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Regions in transition timed out:
ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a.
state=PENDING_CLOSE, ts=1311232346385
2011-07-21 15:15:42,300 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Region has been PENDING_CLOSE for too long, running forced unassign again on
region=ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a.
2011-07-21 15:15:42,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil:
master:60000-0x314b823b0a0000 Set watcher on existing znode
/hbase/unassigned/32cb070f97e3b7b0d261f427a0fc110a
2011-07-21 15:15:42,303 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Starting unassignment of region
ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a. (offlining)
The log after modified code
//I do this several times
hbase(main):004:0> disable 'ufdr5'
11/07/21 15:42:52 DEBUG zookeeper.ZKUtil: hconnection-0x314b967b920003 Set
watcher on existing znode /hbase/root-region-server
2011-07-21 15:50:21,443 DEBUG
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:
master:60000-0x1314b9678830000 Received ZooKeeper Event,
type=NodeChildrenChanged, state=SyncConnected, path=/hbase/unassigned
2011-07-21 15:50:21,571 INFO
org.apache.hadoop.hbase.master.handler.DisableTableHandler: Disabled table is
done=true
2011-07-21 15:51:26,908 DEBUG org.apache.hadoop.hbase.master.LoadBalancer:
Server information: C4C3.site,60020,1311233178997=0,
C4C2.site,60020,1311233178981=1, C4C1.site,60020,1311233179039=1
2011-07-21 15:51:26,908 INFO org.apache.hadoop.hbase.master.LoadBalancer:
Skipping load balancing. servers=3 regions=2 average=0.6666667 mostloaded=1
leastloaded=0
2011-07-21 15:51:27,402 DEBUG org.apache.hadoop.hbase.master.CatalogJanitor:
Scanned 2156 catalog row(s) and gc'd 0 unreferenced parent region(s)
> Two concurrent unassigning of the same region caused the endless loop of
> "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-4064
> URL: https://issues.apache.org/jira/browse/HBASE-4064
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.3
> Reporter: Jieshan Bean
> Fix For: 0.90.5
>
> Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in
> regionsInTransition(The RegionState was remained by some exception which
> should be removed, that's why I called it as "rubbish" object), but the
> region is not currently assigned anywhere, TimeoutMonitor will fall into an
> endless loop:
> 2011-06-27 10:32:21,326 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed
> out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been
> PENDING_CLOSE for too long, running forced unassign again on
> region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> (offlining)
> 2011-06-27 10:32:21,441 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is
> not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed
> out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been
> PENDING_CLOSE for too long, running forced unassign again on
> region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> (offlining)
> 2011-06-27 10:32:31,215 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is
> not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed
> out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been
> PENDING_CLOSE for too long, running forced unassign again on
> region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> (offlining)
> 2011-06-27 10:32:41,172 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is
> not currently assigned anywhere
> .....
> 2 In the following scenario, two concurrent unassigning call of the same
> region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event
> of "RS_ZK_REGION_CLOSED", process this event, will create a
> ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in
> "hbase.master.executor.closeregion.threads" thread (A), another unassign call
> of same region run in another thread(B).
> while thread B run "if (!regions.containsKey(region))", this.regions have
> the region info, now cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and
> "regionsInTransition", then switch to thread B. the thread B run continue,
> will throw an exception with the msg of "Server null returned
> java.lang.NullPointerException: Passed server is null for
> 9a6e26d40293663a79523c58315b930f", but without removing the new-adding
> RegionState from "regionsInTransition",and it can not be removed for ever.
> public void unassign(HRegionInfo region, boolean force) {
> LOG.debug("Starting unassignment of region " +
> region.getRegionNameAsString() + " (offlining)");
> synchronized (this.regions) {
> // Check if this region is currently assigned
> if (!regions.containsKey(region)) {
> LOG.debug("Attempted to unassign region " +
> region.getRegionNameAsString() + " but it is not " +
> "currently assigned anywhere");
> return;
> }
> }
> String encodedName = region.getEncodedName();
> // Grab the state of this region and synchronize on it
> RegionState state;
> long stamp = -1;
> synchronized (regionsInTransition) {
> state = regionsInTransition.get(encodedName);
> if (state == null) {
> state = new RegionState(region, RegionState.State.PENDING_CLOSE);
> stamp =state.getStamp();
> regionsInTransition.put(encodedName, state);
> } else if (force && state.isPendingClose()) {
> LOG.debug("Attempting to unassign region " +
> region.getRegionNameAsString() + " which is already pending close
> "
> + "but forcing an additional close");
> state.update(RegionState.State.PENDING_CLOSE);
> } else {
> LOG.debug("Attempting to unassign region " +
> region.getRegionNameAsString() + " but it is " +
> "already in transition (" + state.getState() + ")");
> return;
> }
> }
>
> // Send CLOSE RPC
> HServerInfo server = null;
> synchronized (this.regions) {
> server = regions.get(region);
> }
> }
> 2011-06-27 10:20:59,583 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> (offlining)
> 2011-06-27 10:20:59,585 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to
> serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0,
> usedHeap=0, maxHeap=0) for region
> test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:22:15,299 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned
> node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f
> (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.,
> server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED)
> 2011-06-27 10:22:15,299 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Handling
> transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292,
> region=9a6e26d40293663a79523c58315b930f
> 2011-06-27 10:25:22,636 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed
> out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> state=CLOSED, ts=1309141335247
> 2011-06-27 10:25:22,636 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Region
> 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on
> queued ClosedRegionHandler to run or server shutdown
> 2011-06-27 10:25:55,022 DEBUG
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED
> event for 9a6e26d40293663a79523c58315b930f
> 2011-06-27 10:25:55,022 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so
> deleting ZK node and removing from regions in transition, skipping assignment
> of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:20000-0x230ba8b85230036 Deleting existing unassigned node for
> 9a6e26d40293663a79523c58315b930f that is in expected state
> RS_ZK_REGION_CLOSED
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:20000-0x230ba8b85230036 Successfully deleted unassigned node for
> region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED
> 2011-06-27 10:25:55,301 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> (offlining)
> 2011-06-27 10:25:55,302 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Server null returned
> java.lang.NullPointerException: Passed server is null for
> 9a6e26d40293663a79523c58315b930f
> 2011-06-27 10:32:21,326 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed
> out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been
> PENDING_CLOSE for too long, running forced unassign again on
> region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> (offlining)
> 2011-06-27 10:32:21,441 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is
> not currently assigned anywhere
> 3 The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test.
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout,
> 'DisableTableHandler.process' will create another BulkDisabler object and
> start its thread pool. The region which was still online will call
> AssignmentManager.unassign again. so the same region
> "AssignmentManager.unassign" could be called concurrentlly more than 1.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira