[ 
https://issues.apache.org/jira/browse/HBASE-12398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14193657#comment-14193657
 ] 

Jimmy Xiang commented on HBASE-12398:
-------------------------------------

The master branch should not have such a problem because only master updates 
the region states (step b won't happen). So I think we don't need a patch for 
master.

> Region isn't assigned in an extreme race condition
> --------------------------------------------------
>
>                 Key: HBASE-12398
>                 URL: https://issues.apache.org/jira/browse/HBASE-12398
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 0.98.7
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>         Attachments: HBASE-12398.patch
>
>
> In a test, [~enis] has seen a condition which made one of the regions 
> unassigned. 
> The client failed since the region is not online anywhere: 
> {code}
> 2014-10-29 01:51:40,731 WARN  [HBaseReaderThread_13] 
> util.MultiThreadedReader: 
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=35, exceptions:
> Wed Oct 29 01:39:51 UTC 2014, 
> org.apache.hadoop.hbase.client.RpcRetryingCaller@cc21330, 
> org.apache.hadoop.hbase.NotServingRegionException: 
> org.apache.hadoop.hbase.NotServingRegionException: Region 
> IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9.
>  is not online on hor8n08.gq1.ygridcore.net,60020,1414546670414
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2774)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4257)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2906)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29990)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2078)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
>         at java.lang.Thread.run(Thread.java:722)
> {code}
> The root cause of the issue is due to some extreme race condition:
> a) a region is about to open and receives a closeRpc request triggered by a 
> second re-assignment
> b) the second re-assignment updates region state to offline while immediately 
> is overwritten to OPEN from previous region open ZK opened notification
> c) when the region reopened on the same RS by the second assignment, AM force 
> the region to close as the its region state isn't in PendingOpenOrOpening 
> state.  
> d) the region ends up offline & can't server any request
> Region Server Side:
> 1) A region almost opens region 689b77e1bad7e951b0d9ef4663b217e9 while the 
> RS(hor8n10) receives a closeRegion request.
> {noformat}
> 2014-10-29 01:39:43,153 INFO  
> [PriorityRpcServer.handler=2,queue=0,port=60020] regionserver.HRegionServer: 
> Received CLOSE for the region:689b77e1bad7e951b0d9ef4663b217e9 , which we are 
> already trying to OPEN. Cancelling OPENING.
> {noformat}
> 2) Since region 689b77e1bad7e951b0d9ef4663b217e9 was already opened right 
> before some final steps, so the RS logs the following message and close 
> 689b77e1bad7e951b0d9ef4663b217e9 immediately after the RS update ZK node 
> state to 'OPENED'.
> {noformat}
> 2014-10-29 01:39:43,198 ERROR [RS_OPEN_REGION-hor8n10:60020-0] 
> handler.OpenRegionHandler: Race condition: we've finished to open a region, 
> while a close was requested  on 
> region=IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9..
>  It can be a critical error, as a region that should be closed is now opened. 
> Closing it now
> {noformat}
> In Master Server Side:
> {noformat}
> 2014-10-29 01:39:43,177 DEBUG [AM.ZK.Worker-pool2-t55] 
> master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
> server=hor8n10.gq1.ygridcore.net,60020,1414546531945, 
> region=689b77e1bad7e951b0d9ef4663b217e9, 
> current_state={689b77e1bad7e951b0d9ef4663b217e9 state=OPENING, 
> ts=1414546783152, server=hor8n10.gq1.ygridcore.net,60020,1414546531945}
> ....
> 2014-10-29 01:39:43,255 DEBUG [AM.-pool1-t16] master.AssignmentManager: 
> Offline 
> IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9.,
>  it's not any more on hor8n10.gq1.ygridcore.net,60020,1414546531945
> ....
> 2014-10-29 01:39:43,942 DEBUG [AM.ZK.Worker-pool2-t58] 
> master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
> server=hor8n10.gq1.ygridcore.net,60020,1414546531945, 
> region=689b77e1bad7e951b0d9ef4663b217e9, 
> current_state={689b77e1bad7e951b0d9ef4663b217e9 state=OPEN, ts=1414546783387, 
> server=hor8n10.gq1.ygridcore.net,60020,1414546531945}
> 2014-10-29 01:39:43,942 WARN  [AM.ZK.Worker-pool2-t58] 
> master.AssignmentManager: Received OPENED for 
> 689b77e1bad7e951b0d9ef4663b217e9 from 
> hor8n10.gq1.ygridcore.net,60020,1414546531945 but the region isn't 
> PENDING_OPEN/OPENING here: {689b77e1bad7e951b0d9ef4663b217e9 state=OPEN, 
> ts=1414546783387, server=hor8n10.gq1.ygridcore.net,60020,1414546531945}
> 2014-10-29 01:39:43,948 DEBUG [AM.ZK.Worker-pool2-t58] 
> master.AssignmentManager: Sent CLOSE to 
> hor8n10.gq1.ygridcore.net,60020,1414546531945 for region 
> IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9.
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to