[
https://issues.apache.org/jira/browse/HBASE-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908673#comment-13908673
]
Enis Soztutar commented on HBASE-10355:
---------------------------------------
Agreed, we need HBASE-10351 before this. I'll address Sergey's comments there.
I am continuing to test this using HBASE-10572. One issue was that, the after
we fire the requests for replicas, and the initial returns back, we interrupt
the RPCs. If the RPC happens to be in the getLocationsFromMeta() phase, then we
end up removing all the entries going to that server with
MetaCache.clearCache(ServerName) (although the server did not fail). This
happens quite often:
{code}
2014-02-21 06:19:46,230 DEBUG [htable-pool68-t6] client.RegionServerCallable:
org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the location
at
org.apache.hadoop.hbase.client.RpcRetryingCallerWithFallBack.getRegionLocations(RpcRetryingCallerWithFallBack.java:253)
at
org.apache.hadoop.hbase.client.RpcRetryingCallerWithFallBack.access$0(RpcRetryingCallerWithFallBack.java:242)
at
org.apache.hadoop.hbase.client.RpcRetryingCallerWithFallBack$ReplicaRegionServerCallable.prepare(RpcRetryingCallerWithFallBack.java:106)
at
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:120)
at
org.apache.hadoop.hbase.client.RpcRetryingCallerWithFallBack$RetryingRPC.call(RpcRetryingCallerWithFallBack.java:148)
at
org.apache.hadoop.hbase.client.RpcRetryingCallerWithFallBack$RetryingRPC.call(RpcRetryingCallerWithFallBack.java:1)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.InterruptedIOException
at
org.apache.hadoop.hbase.util.ExceptionUtil.asInterrupt(ExceptionUtil.java:62)
at
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:297)
at
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1528)
at
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1165)
at
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1038)
at
org.apache.hadoop.hbase.client.RpcRetryingCallerWithFallBack.getRegionLocations(RpcRetryingCallerWithFallBack.java:246)
... 10 more
Caused by: java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1470)
at
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1684)
at
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1740)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.get(ClientProtos.java:29240)
at
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1524)
... 13 more
{code}
The following fragment solves the issue for me. Basically we just rethrow
InterrruptedIOEx. Can you take a look:
{code}
private RegionLocations getRegionLocations(boolean useCache)
- throws RetriesExhaustedException, DoNotRetryIOException {
+ throws RetriesExhaustedException, DoNotRetryIOException,
InterruptedIOException {
RegionLocations rl;
try {
rl = cConnection.locateRegion(tableName, get.getRow(), useCache, true);
+ } catch (DoNotRetryIOException e) {
+ throw e;
+ } catch (RetriesExhaustedException e) {
+ throw e;
+ } catch (InterruptedIOException e) {
+ throw e;
} catch (IOException e) {
- if (e instanceof DoNotRetryIOException) {
- throw (DoNotRetryIOException) e;
- } else if (e instanceof RetriesExhaustedException) {
- throw (RetriesExhaustedException) e;
- } else {
- throw new RetriesExhaustedException("Can't get the location", e);
- }
+ throw new RetriesExhaustedException("Can't get the location", e);
}
if (rl == null) {
throw new RetriesExhaustedException("Can't get the locations");
{code}
I am also running into some other problems about *a lot* of meta cache entries
being evicted although there is no CM running:
{code}
2014-02-21 06:55:29,678 DEBUG [htable-pool89-t3]
client.ConnectionManager$HConnectionImplementation: locateRegionInMeta
parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740,
hostname=hor8n04.gq1.ygridcore.net,60020,1392965421232, seqNum=0}, attempt=4 of
35 failed; retrying after sleep of 1005 because: IPC Client (1480703560)
connection to hor8n04.gq1.ygridcore.net/68.142.245.215:60020 from hrt_qa is
closing
{code}
Still looking into the root cause : )
> Failover RPC's from client using region replicas
> ------------------------------------------------
>
> Key: HBASE-10355
> URL: https://issues.apache.org/jira/browse/HBASE-10355
> Project: HBase
> Issue Type: Sub-task
> Components: Client
> Reporter: Enis Soztutar
> Assignee: Nicolas Liochon
> Fix For: 0.99.0
>
> Attachments: 10355.v1.patch, 10355.v2.patch, 10355.v3.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)