[ 
https://issues.apache.org/jira/browse/HBASE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210629#comment-13210629
 ] 

stack commented on HBASE-5422:
------------------------------

I think I understand what is going on.  Help me out Chunhui.  So, on region 
open, we do update of the RIT timers.  When bulk opening, we are not adding 
plans to this.regionPlans so that when an open comes in from a bulk assign, 
since we go against this.regionPlans, we'll not update timers of other 
outstanding RITs?  This seems like nice bug fix.

I see that BulkReOpen adds to this.regionPlans but it does it one at a time.  
Should it use your putAll?  Maybe we should make an addPlan method that takes a 
Map of plans and have it used by BulkReOpen and by BulkOpen?
                
> StartupBulkAssigner would cause a lot of timeout on RIT when assigning large 
> numbers of regions (timeout = 3 mins)
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5422
>                 URL: https://issues.apache.org/jira/browse/HBASE-5422
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: chunhui shen
>         Attachments: 5422-90.patch, hbase-5422.patch
>
>
> In our produce environment
> We find a lot of timeout on RIT when cluster up, there are about 7w regions 
> in the cluster( 25 regionservers ).
> First, we could see the following log:(See the region 
> 33cf229845b1009aa8a3f7b0f85c9bd0)
> master's log
> 2012-02-13 18:07:41,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:60000-0x348f4a94723da5 Async create of unassigned node for 
> 33cf229845b1009aa8a3f7b0f85c9bd0 with OFFLINE state 
> 2012-02-13 18:07:42,560 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback:
>  rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
> state=OFFLINE, ts=1329127661409, 
> server=r03f11025.yh.aliyun.com,60020,1329127549907 
> 2012-02-13 18:07:42,996 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback:
>  rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
> state=OFFLINE, ts=1329127661409 
> 2012-02-13 18:10:48,072 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
> state=PENDING_OPEN, ts=1329127662996
> 2012-02-13 18:10:48,072 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_OPEN for too long, reassigning 
> region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
> 2012-02-13 18:11:16,744 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_OPENED, 
> server=r03f11025.yh.aliyun.com,60020,1329127549907, 
> region=33cf229845b1009aa8a3f7b0f85c9bd0 
> 2012-02-13 18:38:07,310 DEBUG 
> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
> event for 33cf229845b1009aa8a3f7b0f85c9bd0; deleting unassigned node 
> 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:60000-0x348f4a94723da5 Deleting existing unassigned node for 
> 33cf229845b1009aa8a3f7b0f85c9bd0 that is in expected state 
> RS_ZK_REGION_OPENED 
> 2012-02-13 18:38:07,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:60000-0x348f4a94723da5 Successfully deleted unassigned node for region 
> 33cf229845b1009aa8a3f7b0f85c9bd0 in expected state RS_ZK_REGION_OPENED 
> 2012-02-13 18:38:07,573 DEBUG 
> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region 
> item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. on 
> r03f11025.yh.aliyun.com,60020,1329127549907 
> 2012-02-13 18:50:54,428 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan 
> was found (or we are ignoring an existing plan) for 
> item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. so 
> generated a random one; 
> hri=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0., 
> src=, dest=r01b05043.yh.aliyun.com,60020,1329127549041; 29 (online=29, 
> exclude=null) available servers 
> 2012-02-13 18:50:54,428 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
> item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. to 
> r01b05043.yh.aliyun.com,60020,1329127549041 
> 2012-02-13 19:31:50,514 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
> state=PENDING_OPEN, ts=1329132528086 
> 2012-02-13 19:31:50,514 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_OPEN for too long, reassigning 
> region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
> Regionserver's log
> 2012-02-13 18:07:43,537 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open 
> region: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
> 2012-02-13 18:11:16,560 DEBUG 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing 
> open of item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
> Through the RS's log, we could find it is larger than 3mins from receive 
> openRegion request to start processing openRegion, causing timeout on RIT in 
> master for the region.
> Let's see the code of StartupBulkAssigner, we could find regionPlans are not 
> added when assigning regions, therefore, when one region opened, it will not 
> updateTimers of other regions whose destination is the same.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to