[jira] [Commented] (HBASE-6587) Region would be assigned twice in the case of all RS offline

chunhui shen (JIRA) Thu, 16 Aug 2012 00:10:49 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435805#comment-13435805
 ]


chunhui shen commented on HBASE-6587:
-------------------------------------

@ram
{code}
2012-08-14 20:42:54,367 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Unable to determine a plan to assign .META.,,1.1028785192 state=OFFLINE, ts=1
344948174367, server=null
{code}

After the above log, TimeoutMonitor set allRegionServersOffline true

{code}2012-08-14 20:44:31,640 DEBUG 
org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan 
was found (or we are ignoring an existing plan) for writete
st,VHXYHJN0BL48HMR4DI1L,1344925649429.277b9b6df6de2b9be1353b4fa25f4222. so 
generated a random one; 
hri=writetest,VHXYHJN0BL48HMR4DI1L,1344925649429.277b9b6df6de2b9be13
53b4fa25f4222., src=, dest=dw92.kgb.sqa.cm4,60020,1344948267642; 1 (online=1, 
available=1) available {code}

At the 2012-08-14 20:44:31, one server is onlined now, and region 
277b9b6df6de2b9be1353b4fa25f4222 is sucessfully assigned.

However, at that time TimeoutMonitor, in th chore(), it would act on time out 
because the if block {
code}if (this.allRegionServersOffline && !allRSsOffline){code} return true;

So we see the following log
{code}2012-08-14 20:44:32,518 INFO 
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
out: writetest,VHXYHJN0BL48HMR4DI1L,1344925649429.277b9b6df
6de2b9be1353b4fa25f4222. state=OPENING, ts=1344948272279, 
server=dw92.kgb.sqa.cm4,60020,1344948267642
2012-08-14 20:44:32,518 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Region has been OPENING for too long, reassigning 
region=writetest,VHXYHJN0BL48HMR4DI1L,
1344925649429.277b9b6df6de2b9be1353b4fa25f4222.
{code}

The region is assigned at the time 2012-08-14 20:44:31, but is timed out by 
TimeoutMonitor at the time 2012-08-14 20:44:32. 
It cause the collision by two assign thread,
And the result is that the region is onlined after 30mins.
                
> Region would be assigned twice in the case of all RS offline
> ------------------------------------------------------------
>
>                 Key: HBASE-6587
>                 URL: https://issues.apache.org/jira/browse/HBASE-6587
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.1
>            Reporter: chunhui shen
>            Assignee: chunhui shen
>             Fix For: 0.96.0
>
>         Attachments: 6587.patch, HBASE-6587.patch
>
>
> In the TimeoutMonitor, we would act on time out for the regions if 
> (this.allRegionServersOffline && !noRSAvailable)
> The code is as the following:
> {code}
>  if (regionState.getStamp() + timeout <= now ||
>           (this.allRegionServersOffline && !noRSAvailable)) {
>           //decide on action upon timeout or, if some RSs just came back 
> online, we can start the
>           // the assignment
>           actOnTimeOut(regionState);
>         }
> {code}
> But we found it exists a bug that it would act on time out for the region 
> which was assigned just now , and cause assigning the region twice.
> Master log for the region 277b9b6df6de2b9be1353b4fa25f4222:
> {code}
> 2012-08-14 20:42:54,367 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Unable to determine a plan 
> to assign .META.,,1.1028785192 state=OFFLINE, ts=1
> 344948174367, server=null
> 2012-08-14 20:44:31,640 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan 
> was found (or we are ignoring an existing plan) for writete
> st,VHXYHJN0BL48HMR4DI1L,1344925649429.277b9b6df6de2b9be1353b4fa25f4222. so 
> generated a random one; 
> hri=writetest,VHXYHJN0BL48HMR4DI1L,1344925649429.277b9b6df6de2b9be13
> 53b4fa25f4222., src=, dest=dw92.kgb.sqa.cm4,60020,1344948267642; 1 (online=1, 
> available=1) available servers
> 2012-08-14 20:44:31,640 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:60000-0x438f53bbf9b0acd Creating (or updating) unassigned node for 
> 277b9b6df6de2b9be13
> 53b4fa25f4222 with OFFLINE state
> 2012-08-14 20:44:31,643 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
> writetest,VHXYHJN0BL48HMR4DI1L,1344925649429.277b9b6df6de2b9be1353b4fa
> 25f4222. to dw92.kgb.sqa.cm4,60020,1344948267642
> 2012-08-14 20:44:32,291 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_OPENING, server=dw92.kgb.sqa.cm4,60020,1344948267642, 
> region=277b9b6df6de2b9be1353b4fa25f4222
> // 异常的超时
> 2012-08-14 20:44:32,518 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out: writetest,VHXYHJN0BL48HMR4DI1L,1344925649429.277b9b6df
> 6de2b9be1353b4fa25f4222. state=OPENING, ts=1344948272279, 
> server=dw92.kgb.sqa.cm4,60020,1344948267642
> 2012-08-14 20:44:32,518 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPENING for 
> too long, reassigning region=writetest,VHXYHJN0BL48HMR4DI1L,
> 1344925649429.277b9b6df6de2b9be1353b4fa25f4222.
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6587) Region would be assigned twice in the case of all RS offline

Reply via email to