Jeffrey Zhong created HBASE-12398:
-------------------------------------

             Summary: Region isn't assigned in a extreme race condition
                 Key: HBASE-12398
                 URL: https://issues.apache.org/jira/browse/HBASE-12398
             Project: HBase
          Issue Type: Bug
          Components: Region Assignment
    Affects Versions: 0.98.7
            Reporter: Jeffrey Zhong
            Assignee: Jeffrey Zhong


In a test, [~enis] has seen a condition which made one of the regions 
unassigned. 

The client failed since the region is not online anywhere: 
{code}
2014-10-29 01:51:40,731 WARN  [HBaseReaderThread_13] util.MultiThreadedReader: 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=35, exceptions:
Wed Oct 29 01:39:51 UTC 2014, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@cc21330, 
org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.NotServingRegionException: Region 
IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9.
 is not online on hor8n08.gq1.ygridcore.net,60020,1414546670414
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2774)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4257)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2906)
        at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29990)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2078)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
        at java.lang.Thread.run(Thread.java:722)
{code}

The root cause of the issue is due to some extreme race condition:

a) a region is about to open and receives a closeRpc request triggered by a 
second re-assignment
b) the second re-assignment updates region state to offline while immediately 
is overwritten to OPEN from previous region open ZK opened notification
c) when the region reopened on the same RS by the second assignment, AM force 
the region to close as the its region state isn't in PendingOpenOrOpening 
state.  
d) the region ends up offline & can't server any request

Region Server Side:
1) A region almost opens region 689b77e1bad7e951b0d9ef4663b217e9 while the 
RS(hor8n10) receives a closeRegion request.

{noformat}
2014-10-29 01:39:43,153 INFO  [PriorityRpcServer.handler=2,queue=0,port=60020] 
regionserver.HRegionServer: Received CLOSE for the 
region:689b77e1bad7e951b0d9ef4663b217e9 , which we are already trying to OPEN. 
Cancelling OPENING.
{noformat}

2) Since region 689b77e1bad7e951b0d9ef4663b217e9 was already opened right 
before some final steps, so the RS logs the following message and close 
689b77e1bad7e951b0d9ef4663b217e9 immediately after the RS update ZK node state 
to 'OPENED'.

{noformat}
2014-10-29 01:39:43,198 ERROR [RS_OPEN_REGION-hor8n10:60020-0] 
handler.OpenRegionHandler: Race condition: we've finished to open a region, 
while a close was requested  on 
region=IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9..
 It can be a critical error, as a region that should be closed is now opened. 
Closing it now
{noformat}

In Master Server Side:
{noformat}
2014-10-29 01:39:43,177 DEBUG [AM.ZK.Worker-pool2-t55] 
master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
server=hor8n10.gq1.ygridcore.net,60020,1414546531945, 
region=689b77e1bad7e951b0d9ef4663b217e9, 
current_state={689b77e1bad7e951b0d9ef4663b217e9 state=OPENING, 
ts=1414546783152, server=hor8n10.gq1.ygridcore.net,60020,1414546531945}
....
2014-10-29 01:39:43,255 DEBUG [AM.-pool1-t16] master.AssignmentManager: Offline 
IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9.,
 it's not any more on hor8n10.gq1.ygridcore.net,60020,1414546531945
....
2014-10-29 01:39:43,942 DEBUG [AM.ZK.Worker-pool2-t58] 
master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
server=hor8n10.gq1.ygridcore.net,60020,1414546531945, 
region=689b77e1bad7e951b0d9ef4663b217e9, 
current_state={689b77e1bad7e951b0d9ef4663b217e9 state=OPEN, ts=1414546783387, 
server=hor8n10.gq1.ygridcore.net,60020,1414546531945}
2014-10-29 01:39:43,942 WARN  [AM.ZK.Worker-pool2-t58] 
master.AssignmentManager: Received OPENED for 689b77e1bad7e951b0d9ef4663b217e9 
from hor8n10.gq1.ygridcore.net,60020,1414546531945 but the region isn't 
PENDING_OPEN/OPENING here: {689b77e1bad7e951b0d9ef4663b217e9 state=OPEN, 
ts=1414546783387, server=hor8n10.gq1.ygridcore.net,60020,1414546531945}
2014-10-29 01:39:43,948 DEBUG [AM.ZK.Worker-pool2-t58] 
master.AssignmentManager: Sent CLOSE to 
hor8n10.gq1.ygridcore.net,60020,1414546531945 for region 
IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9.
{noformat}
 






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to