[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HBASE-4064:
------------------------------------------

    Fix Version/s:     (was: 0.90.6)
                   0.90.7
    
> Two concurrent unassigning of the same region caused the endless loop of 
> "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.7
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, 
> disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in 
> regionsInTransition(The RegionState was remained by some exception which 
> should be removed, that's why I called it as "rubbish" object), but the 
> region is not currently assigned anywhere, TimeoutMonitor will fall into an 
> endless loop:
> 2011-06-27 10:32:21,326 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> (offlining)
> 2011-06-27 10:32:21,441 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
> not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> (offlining)
> 2011-06-27 10:32:31,215 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
> not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> (offlining)
> 2011-06-27 10:32:41,172 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
> not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same 
> region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event 
> of "RS_ZK_REGION_CLOSED", process this event, will create a 
> ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  
> "hbase.master.executor.closeregion.threads" thread (A), another unassign call 
> of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have 
> the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and 
> "regionsInTransition", then switch to thread B. the thread B run continue, 
> will throw an exception with the msg of "Server null returned 
> java.lang.NullPointerException: Passed server is null for 
> 9a6e26d40293663a79523c58315b930f", but without removing the new-adding 
> RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close 
> "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> (offlining) 
> 2011-06-27 10:20:59,585 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, 
> usedHeap=0, maxHeap=0) for region 
> test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
> node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f 
> (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., 
> server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, 
> region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region 
> 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on 
> queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
> event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
> deleting ZK node and removing from regions in transition, skipping assignment 
> of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:20000-0x230ba8b85230036 Deleting existing unassigned node for 
> 9a6e26d40293663a79523c58315b930f that is in expected state 
> RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:20000-0x230ba8b85230036 Successfully deleted unassigned node for 
> region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> (offlining) 
> 2011-06-27 10:25:55,302 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
> java.lang.NullPointerException: Passed server is null for 
> 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> (offlining) 
> 2011-06-27 10:32:21,441 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
> not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 
> 'DisableTableHandler.process' will create another BulkDisabler object and 
> start its thread pool. The region which was still online will call 
> AssignmentManager.unassign again. so the same region 
> "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to