[ 
https://issues.apache.org/jira/browse/HBASE-9665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell resolved HBASE-9665.
----------------------------------------
    Resolution: Incomplete

> Region gets lost when balancer & SSH both trying to assign 
> -----------------------------------------------------------
>
>                 Key: HBASE-9665
>                 URL: https://issues.apache.org/jira/browse/HBASE-9665
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 0.96.0
>            Reporter: Jeffrey Zhong
>            Priority: Critical
>
> In summary, a server dies and its regions are re-assigned. While right before 
> SSH, balancer is starting assign one region on the server to somewhere. 
> The balancer assignment got preempted by the SSH assignment:
> {code}
> 2013-09-25 11:55:32,854 INFO Priority.RpcServer.handler=7,port=60020 
> regionserver.HRegionServer: Received CLOSE for the 
> region:6deb1bfefe8cbdb443084efe919fdeb7 , which we are already trying to 
> OPEN. Cancelling OPENING.
> {code}
> The SSH assignment(by GeneralBulkAssigner) failed too due to:
> {code}
> 2013-09-25 11:55:32,927 WARN  [RS_OPEN_REGION-hor15n09:60020-2] 
> zookeeper.ZKAssign: regionserver:60020-0x14153d449d30ad0 Attempt to 
> transition the unassigned node for 6deb1bfefe8cbdb443084efe919fdeb7 from 
> M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the server that tried to 
> transition was hor15n09.gq1.ygridcore.net,60020,1380109280320 not the 
> expected hor15n07.gq1.ygridcore.net,60020,1380109890414
> {code}
> In the end, the region 6deb1bfefe8cbdb443084efe919fdeb7 is lost.
> Below is the master log, you can see both balancer and SSH try to assign the 
> region around the same time:
> {code}
> 2013-09-25 11:55:32,731 INFO  [MASTER_SERVER_OPERATIONS-hor15n05:60000-4] 
> master.RegionStates: Transitioning {6deb1bfefe8cbdb443084efe919fdeb7 
> state=PENDING_CLOSE, ts=1380110132710, 
> server=hor15n12.gq1.ygridcore.net,60020,1380109596307} will be handled by SSH 
> for hor15n12.gq1.ygridcore.net,60020,1380109596307
> ...
> 2013-09-25 11:55:32,849 INFO  
> [hor15n05.gq1.ygridcore.net,60000,1380108611483-BalancerChore] 
> master.RegionStates: Transitioned {6deb1bfefe8cbdb443084efe919fdeb7 
> state=OFFLINE, ts=1380110132768, server=null} to 
> {6deb1bfefe8cbdb443084efe919fdeb7 state=PENDING_OPEN, ts=1380110132849, 
> server=hor15n07.gq1.ygridcore.net,60020,1380109890414}
> ...
> 2013-09-25 11:55:32,898 INFO  
> [hor15n05.gq1.ygridcore.net,60000,1380108611483-GeneralBulkAssigner-1] 
> master.RegionStates: Transitioned {6deb1bfefe8cbdb443084efe919fdeb7 
> state=OFFLINE, ts=1380110132861, server=null} to 
> {6deb1bfefe8cbdb443084efe919fdeb7 state=PENDING_OPEN, ts=1380110132898, 
> server=hor15n09.gq1.ygridcore.net,60020,1380109280320}
> {code}
> Since SSH force region assignment while it doesn't recreate offline znode, 
> the later region opening would fail with the following error. I'm suggesting 
> to recreate offline znode when we force a region 
> assignment(forceNewPlan=true) with low impact.
> {code}
> 2013-09-25 11:55:32,927 WARN  [RS_OPEN_REGION-hor15n09:60020-2] 
> zookeeper.ZKAssign: regionserver:60020-0x14153d449d30ad0 Attempt to 
> transition the unassigned node for 6deb1bfefe8cbdb443084efe919fdeb7 from 
> M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the server that tried to 
> transition was hor15n09.gq1.ygridcore.net,60020,1380109280320 not the 
> expected hor15n07.gq1.ygridcore.net,60020,1380109890414
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to