StartupBulkAssigner would cause a lot of timeout on RIT when assigning large
numbers of regions (timeout = 3 mins)
------------------------------------------------------------------------------------------------------------------
Key: HBASE-5422
URL: https://issues.apache.org/jira/browse/HBASE-5422
Project: HBase
Issue Type: Bug
Components: master
Reporter: chunhui shen
In our produce environment
We find a lot of timeout on RIT when cluster up, there are about 7w regions in
the cluster( 25 regionservers ).
First, we could see the following log:(See the region
33cf229845b1009aa8a3f7b0f85c9bd0)
master's log
2012-02-13 18:07:41,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:60000-0x348f4a94723da5 Async create of unassigned node for
33cf229845b1009aa8a3f7b0f85c9bd0 with OFFLINE state
2012-02-13 18:07:42,560 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback:
rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0.
state=OFFLINE, ts=1329127661409,
server=r03f11025.yh.aliyun.com,60020,1329127549907
2012-02-13 18:07:42,996 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback:
rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0.
state=OFFLINE, ts=1329127661409
2012-02-13 18:10:48,072 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Regions in transition timed out:
item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0.
state=PENDING_OPEN, ts=1329127662996
2012-02-13 18:10:48,072 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Region has been PENDING_OPEN for too long, reassigning
region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0.
2012-02-13 18:11:16,744 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Handling transition=RS_ZK_REGION_OPENED,
server=r03f11025.yh.aliyun.com,60020,1329127549907,
region=33cf229845b1009aa8a3f7b0f85c9bd0
2012-02-13 18:38:07,310 DEBUG
org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED
event for 33cf229845b1009aa8a3f7b0f85c9bd0; deleting unassigned node
2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:60000-0x348f4a94723da5 Deleting existing unassigned node for
33cf229845b1009aa8a3f7b0f85c9bd0 that is in expected state RS_ZK_REGION_OPENED
2012-02-13 18:38:07,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:60000-0x348f4a94723da5 Successfully deleted unassigned node for region
33cf229845b1009aa8a3f7b0f85c9bd0 in expected state RS_ZK_REGION_OPENED
2012-02-13 18:38:07,573 DEBUG
org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region
item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. on
r03f11025.yh.aliyun.com,60020,1329127549907
2012-02-13 18:50:54,428 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
No previous transition plan was found (or we are ignoring an existing plan) for
item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. so
generated a random one;
hri=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0., src=,
dest=r01b05043.yh.aliyun.com,60020,1329127549041; 29 (online=29, exclude=null)
available servers
2012-02-13 18:50:54,428 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Assigning region
item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. to
r01b05043.yh.aliyun.com,60020,1329127549041
2012-02-13 19:31:50,514 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Regions in transition timed out:
item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0.
state=PENDING_OPEN, ts=1329132528086
2012-02-13 19:31:50,514 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Region has been PENDING_OPEN for too long, reassigning
region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0.
Regionserver's log
2012-02-13 18:07:43,537 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open
region: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0.
2012-02-13 18:11:16,560 DEBUG
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open
of item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0.
Through the RS's log, we could find it is larger than 3mins from receive
openRegion request to start processing openRegion, causing timeout on RIT in
master for the region.
Let's see the code of StartupBulkAssigner, we could find regionPlans are not
added when assigning regions, therefore, when one region opened, it will not
updateTimers of other regions whose destination is the same.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira