[
https://issues.apache.org/jira/browse/HBASE-12398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeffrey Zhong updated HBASE-12398:
----------------------------------
Attachment: HBASE-12398.patch
The patch makes sure that we transition a region's state to open in checkAndPut
fashion so it won't overlap with another operation.
> Region isn't assigned in an extreme race condition
> --------------------------------------------------
>
> Key: HBASE-12398
> URL: https://issues.apache.org/jira/browse/HBASE-12398
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Affects Versions: 0.98.7
> Reporter: Jeffrey Zhong
> Assignee: Jeffrey Zhong
> Attachments: HBASE-12398.patch
>
>
> In a test, [~enis] has seen a condition which made one of the regions
> unassigned.
> The client failed since the region is not online anywhere:
> {code}
> 2014-10-29 01:51:40,731 WARN [HBaseReaderThread_13]
> util.MultiThreadedReader:
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=35, exceptions:
> Wed Oct 29 01:39:51 UTC 2014,
> org.apache.hadoop.hbase.client.RpcRetryingCaller@cc21330,
> org.apache.hadoop.hbase.NotServingRegionException:
> org.apache.hadoop.hbase.NotServingRegionException: Region
> IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9.
> is not online on hor8n08.gq1.ygridcore.net,60020,1414546670414
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2774)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4257)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2906)
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29990)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2078)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
> at java.lang.Thread.run(Thread.java:722)
> {code}
> The root cause of the issue is due to some extreme race condition:
> a) a region is about to open and receives a closeRpc request triggered by a
> second re-assignment
> b) the second re-assignment updates region state to offline while immediately
> is overwritten to OPEN from previous region open ZK opened notification
> c) when the region reopened on the same RS by the second assignment, AM force
> the region to close as the its region state isn't in PendingOpenOrOpening
> state.
> d) the region ends up offline & can't server any request
> Region Server Side:
> 1) A region almost opens region 689b77e1bad7e951b0d9ef4663b217e9 while the
> RS(hor8n10) receives a closeRegion request.
> {noformat}
> 2014-10-29 01:39:43,153 INFO
> [PriorityRpcServer.handler=2,queue=0,port=60020] regionserver.HRegionServer:
> Received CLOSE for the region:689b77e1bad7e951b0d9ef4663b217e9 , which we are
> already trying to OPEN. Cancelling OPENING.
> {noformat}
> 2) Since region 689b77e1bad7e951b0d9ef4663b217e9 was already opened right
> before some final steps, so the RS logs the following message and close
> 689b77e1bad7e951b0d9ef4663b217e9 immediately after the RS update ZK node
> state to 'OPENED'.
> {noformat}
> 2014-10-29 01:39:43,198 ERROR [RS_OPEN_REGION-hor8n10:60020-0]
> handler.OpenRegionHandler: Race condition: we've finished to open a region,
> while a close was requested on
> region=IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9..
> It can be a critical error, as a region that should be closed is now opened.
> Closing it now
> {noformat}
> In Master Server Side:
> {noformat}
> 2014-10-29 01:39:43,177 DEBUG [AM.ZK.Worker-pool2-t55]
> master.AssignmentManager: Handling RS_ZK_REGION_OPENED,
> server=hor8n10.gq1.ygridcore.net,60020,1414546531945,
> region=689b77e1bad7e951b0d9ef4663b217e9,
> current_state={689b77e1bad7e951b0d9ef4663b217e9 state=OPENING,
> ts=1414546783152, server=hor8n10.gq1.ygridcore.net,60020,1414546531945}
> ....
> 2014-10-29 01:39:43,255 DEBUG [AM.-pool1-t16] master.AssignmentManager:
> Offline
> IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9.,
> it's not any more on hor8n10.gq1.ygridcore.net,60020,1414546531945
> ....
> 2014-10-29 01:39:43,942 DEBUG [AM.ZK.Worker-pool2-t58]
> master.AssignmentManager: Handling RS_ZK_REGION_OPENED,
> server=hor8n10.gq1.ygridcore.net,60020,1414546531945,
> region=689b77e1bad7e951b0d9ef4663b217e9,
> current_state={689b77e1bad7e951b0d9ef4663b217e9 state=OPEN, ts=1414546783387,
> server=hor8n10.gq1.ygridcore.net,60020,1414546531945}
> 2014-10-29 01:39:43,942 WARN [AM.ZK.Worker-pool2-t58]
> master.AssignmentManager: Received OPENED for
> 689b77e1bad7e951b0d9ef4663b217e9 from
> hor8n10.gq1.ygridcore.net,60020,1414546531945 but the region isn't
> PENDING_OPEN/OPENING here: {689b77e1bad7e951b0d9ef4663b217e9 state=OPEN,
> ts=1414546783387, server=hor8n10.gq1.ygridcore.net,60020,1414546531945}
> 2014-10-29 01:39:43,948 DEBUG [AM.ZK.Worker-pool2-t58]
> master.AssignmentManager: Sent CLOSE to
> hor8n10.gq1.ygridcore.net,60020,1414546531945 for region
> IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9.
> {noformat}
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)