[ https://issues.apache.org/jira/browse/HBASE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435805#comment-13435805 ]
Zhihong Ted Yu edited comment on HBASE-6587 at 8/17/12 12:37 AM: ----------------------------------------------------------------- @ram {code} 2012-08-14 20:42:54,367 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Unable to determine a plan to assign .META.,,1.1028785192 state=OFFLINE, ts=1 344948174367, server=null {code} After the above log, TimeoutMonitor set allRegionServersOffline true {code}2012-08-14 20:44:31,640 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for writete st,VHXYHJN0BL48HMR4DI1L,1344925649429.277b9b6df6de2b9be1353b4fa25f4222. so generated a random one; hri=writetest,VHXYHJN0BL48HMR4DI1L,1344925649429.277b9b6df6de2b9be13 53b4fa25f4222., src=, dest=dw92.kgb.sqa.cm4,60020,1344948267642; 1 (online=1, available=1) available {code} At the 2012-08-14 20:44:31, one server is onlined now, and region 277b9b6df6de2b9be1353b4fa25f4222 is sucessfully assigned. However, at that time TimeoutMonitor, in th chore(), it would act on time out because the if block {code} if (this.allRegionServersOffline && !allRSsOffline) return true; {code} So we see the following log {code} 2012-08-14 20:44:32,518 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: writetest,VHXYHJN0BL48HMR4DI1L,1344925649429.277b9b6df 6de2b9be1353b4fa25f4222. state=OPENING, ts=1344948272279, server=dw92.kgb.sqa.cm4,60020,1344948267642 2012-08-14 20:44:32,518 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPENING for too long, reassigning region=writetest,VHXYHJN0BL48HMR4DI1L, 1344925649429.277b9b6df6de2b9be1353b4fa25f4222. {code} The region is assigned at the time 2012-08-14 20:44:31, but is timed out by TimeoutMonitor at the time 2012-08-14 20:44:32. It cause the collision by two assign thread, And the result is that the region is onlined after 30mins. was (Author: zjushch): @ram {code} 2012-08-14 20:42:54,367 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Unable to determine a plan to assign .META.,,1.1028785192 state=OFFLINE, ts=1 344948174367, server=null {code} After the above log, TimeoutMonitor set allRegionServersOffline true {code}2012-08-14 20:44:31,640 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for writete st,VHXYHJN0BL48HMR4DI1L,1344925649429.277b9b6df6de2b9be1353b4fa25f4222. so generated a random one; hri=writetest,VHXYHJN0BL48HMR4DI1L,1344925649429.277b9b6df6de2b9be13 53b4fa25f4222., src=, dest=dw92.kgb.sqa.cm4,60020,1344948267642; 1 (online=1, available=1) available {code} At the 2012-08-14 20:44:31, one server is onlined now, and region 277b9b6df6de2b9be1353b4fa25f4222 is sucessfully assigned. However, at that time TimeoutMonitor, in th chore(), it would act on time out because the if block { code}if (this.allRegionServersOffline && !allRSsOffline){code} return true; So we see the following log {code}2012-08-14 20:44:32,518 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: writetest,VHXYHJN0BL48HMR4DI1L,1344925649429.277b9b6df 6de2b9be1353b4fa25f4222. state=OPENING, ts=1344948272279, server=dw92.kgb.sqa.cm4,60020,1344948267642 2012-08-14 20:44:32,518 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPENING for too long, reassigning region=writetest,VHXYHJN0BL48HMR4DI1L, 1344925649429.277b9b6df6de2b9be1353b4fa25f4222. {code} The region is assigned at the time 2012-08-14 20:44:31, but is timed out by TimeoutMonitor at the time 2012-08-14 20:44:32. It cause the collision by two assign thread, And the result is that the region is onlined after 30mins. > Region would be assigned twice in the case of all RS offline > ------------------------------------------------------------ > > Key: HBASE-6587 > URL: https://issues.apache.org/jira/browse/HBASE-6587 > Project: HBase > Issue Type: Bug > Affects Versions: 0.94.1 > Reporter: chunhui shen > Assignee: chunhui shen > Fix For: 0.96.0 > > Attachments: 6587.patch, HBASE-6587.patch > > > In the TimeoutMonitor, we would act on time out for the regions if > (this.allRegionServersOffline && !noRSAvailable) > The code is as the following: > {code} > if (regionState.getStamp() + timeout <= now || > (this.allRegionServersOffline && !noRSAvailable)) { > //decide on action upon timeout or, if some RSs just came back > online, we can start the > // the assignment > actOnTimeOut(regionState); > } > {code} > But we found it exists a bug that it would act on time out for the region > which was assigned just now , and cause assigning the region twice. > Master log for the region 277b9b6df6de2b9be1353b4fa25f4222: > {code} > 2012-08-14 20:42:54,367 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Unable to determine a plan > to assign .META.,,1.1028785192 state=OFFLINE, ts=1 > 344948174367, server=null > 2012-08-14 20:44:31,640 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan > was found (or we are ignoring an existing plan) for writete > st,VHXYHJN0BL48HMR4DI1L,1344925649429.277b9b6df6de2b9be1353b4fa25f4222. so > generated a random one; > hri=writetest,VHXYHJN0BL48HMR4DI1L,1344925649429.277b9b6df6de2b9be13 > 53b4fa25f4222., src=, dest=dw92.kgb.sqa.cm4,60020,1344948267642; 1 (online=1, > available=1) available servers > 2012-08-14 20:44:31,640 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:60000-0x438f53bbf9b0acd Creating (or updating) unassigned node for > 277b9b6df6de2b9be13 > 53b4fa25f4222 with OFFLINE state > 2012-08-14 20:44:31,643 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > writetest,VHXYHJN0BL48HMR4DI1L,1344925649429.277b9b6df6de2b9be1353b4fa > 25f4222. to dw92.kgb.sqa.cm4,60020,1344948267642 > 2012-08-14 20:44:32,291 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling > transition=RS_ZK_REGION_OPENING, server=dw92.kgb.sqa.cm4,60020,1344948267642, > region=277b9b6df6de2b9be1353b4fa25f4222 > // 异常的超时 > 2012-08-14 20:44:32,518 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed > out: writetest,VHXYHJN0BL48HMR4DI1L,1344925649429.277b9b6df > 6de2b9be1353b4fa25f4222. state=OPENING, ts=1344948272279, > server=dw92.kgb.sqa.cm4,60020,1344948267642 > 2012-08-14 20:44:32,518 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPENING for > too long, reassigning region=writetest,VHXYHJN0BL48HMR4DI1L, > 1344925649429.277b9b6df6de2b9be1353b4fa25f4222. > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira