[ 
https://issues.apache.org/jira/browse/HBASE-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908673#comment-13908673
 ] 

Enis Soztutar commented on HBASE-10355:
---------------------------------------

Agreed, we need HBASE-10351 before this. I'll address Sergey's comments there. 
I am continuing to test this using HBASE-10572. One issue was that, the after 
we fire the requests for replicas, and the initial returns back, we interrupt 
the RPCs. If the RPC happens to be in the getLocationsFromMeta() phase, then we 
end up removing all the entries going to that server with 
MetaCache.clearCache(ServerName) (although the server did not fail). This 
happens quite often: 
{code}
2014-02-21 06:19:46,230 DEBUG [htable-pool68-t6] client.RegionServerCallable: 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the location
        at 
org.apache.hadoop.hbase.client.RpcRetryingCallerWithFallBack.getRegionLocations(RpcRetryingCallerWithFallBack.java:253)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCallerWithFallBack.access$0(RpcRetryingCallerWithFallBack.java:242)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCallerWithFallBack$ReplicaRegionServerCallable.prepare(RpcRetryingCallerWithFallBack.java:106)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:120)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCallerWithFallBack$RetryingRPC.call(RpcRetryingCallerWithFallBack.java:148)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCallerWithFallBack$RetryingRPC.call(RpcRetryingCallerWithFallBack.java:1)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.InterruptedIOException
        at 
org.apache.hadoop.hbase.util.ExceptionUtil.asInterrupt(ExceptionUtil.java:62)
        at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:297)
        at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1528)
        at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1165)
        at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1038)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCallerWithFallBack.getRegionLocations(RpcRetryingCallerWithFallBack.java:246)
        ... 10 more
Caused by: java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1470)
        at 
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1684)
        at 
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1740)
        at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.get(ClientProtos.java:29240)
        at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1524)
        ... 13 more
{code}

The following fragment solves the issue for me. Basically we just rethrow 
InterrruptedIOEx. Can you take a look:
{code}
   private RegionLocations getRegionLocations(boolean useCache)
-      throws RetriesExhaustedException, DoNotRetryIOException {
+      throws RetriesExhaustedException, DoNotRetryIOException, 
InterruptedIOException {
     RegionLocations rl;
     try {
       rl = cConnection.locateRegion(tableName, get.getRow(), useCache, true);
+    } catch (DoNotRetryIOException e) {
+      throw e;
+    } catch (RetriesExhaustedException e) {
+      throw e;
+    } catch (InterruptedIOException e) {
+      throw e;
     } catch (IOException e) {
-      if (e instanceof DoNotRetryIOException) {
-        throw (DoNotRetryIOException) e;
-      } else if (e instanceof RetriesExhaustedException) {
-        throw (RetriesExhaustedException) e;
-      } else {
-        throw new RetriesExhaustedException("Can't get the location", e);
-      }
+      throw new RetriesExhaustedException("Can't get the location", e);
     }
     if (rl == null) {
       throw new RetriesExhaustedException("Can't get the locations");
{code}
I am also running into some other problems about *a lot* of meta cache entries 
being evicted although there is no CM running: 
{code}
2014-02-21 06:55:29,678 DEBUG [htable-pool89-t3] 
client.ConnectionManager$HConnectionImplementation: locateRegionInMeta 
parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
hostname=hor8n04.gq1.ygridcore.net,60020,1392965421232, seqNum=0}, attempt=4 of 
35 failed; retrying after sleep of 1005 because: IPC Client (1480703560) 
connection to hor8n04.gq1.ygridcore.net/68.142.245.215:60020 from hrt_qa is 
closing
{code}
Still looking into the root cause : ) 


> Failover RPC's from client using region replicas
> ------------------------------------------------
>
>                 Key: HBASE-10355
>                 URL: https://issues.apache.org/jira/browse/HBASE-10355
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Client
>            Reporter: Enis Soztutar
>            Assignee: Nicolas Liochon
>             Fix For: 0.99.0
>
>         Attachments: 10355.v1.patch, 10355.v2.patch, 10355.v3.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to