[
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13670062#comment-13670062
]
stack commented on HBASE-6364:
------------------------------
[~nkeywal] What is supposed to happen when the FailedServerException is thrown?
The below is hard to read -- it is out of our loadtesttool used alot in
hbase-it.
A loadtesttool thread is failing with a FailedServerException... "This server
is in the failed servers list'. Reading the above, I thought we were supposed
to pause and retry but if you notice, we are doing connection setup using
withoutRetries... because we are inside a Process.
Any input would help. Sorry for being short. Have to run (I'm trying to
figure failing hbase-it tests up on ec2 and on internal jenkins). Thanks.
{code}
2013-05-27 17:08:53,193 ERROR [HBaseWriterThread_3]
util.MultiThreadedWriter(191): Failed to insert: 48349; region information:
cached:
region=IntegrationTestDataIngestWithChaosMonkey,8cccccc4,1369699538187.6b2be2c004e633f8eed7de8aff6f7cfd.,
hostname=a1007.halxg.cloudera.com,53752,1369699532487, seqNum=214684; cache is
up to date; errors: Error from [a1007.halxg.cloudera.com:53752] for
[979402d0d20fb0f8ded281a8b8687ab9-48349]java.io.IOException: Call to
a1007.halxg.cloudera.com/10.20.184.107:53752 failed on local exception:
org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in
the failed servers list: a1007.halxg.cloudera.com/10.20.184.107:53752
at
org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClient.java:1368)er$HConnectionImplementation$Process$1.call(HConnectionManager.java:2463)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This
server is in the failed servers list:
a1007.halxg.cloudera.com/10.20.184.107:53752
at
org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(RpcClient.java:798)
at
org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1422)
at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1314)
... 15 more
2013-05-27 17:08:53,193 ERROR [HBaseWriterThread_0]
util.MultiThreadedWriter(191): Failed to insert: 48366; region information:
cached:
region=IntegrationTestDataIngestWithChaosMonkey,8cccccc4,1369699538187.6b2be2c004e633f8eed7de8aff6f7cfd.,
hostname=a1007.halxg.cloudera.com,53752,1369699532487, seqNum=214684; cache is
up to date; errors: Error from [a1007.halxg.cloudera.com:53752] for
[92873a55c54f98db38508ba065852cc5-48366]java.io.IOException: Call to
a1007.halxg.cloudera.com/10.20.184.107:53752 failed on local exception:
org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in
the failed servers list: a1007.halxg.cloudera.com/10.20.184.107:53752
at
org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClient.java:1368)
at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1340)
at
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1540)
at
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1597)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.multi(ClientProtos.java:21403)
at
org.apache.hadoop.hbase.client.MultiServerCallable.call(MultiServerCallable.java:102)
at
org.apache.hadoop.hbase.client.MultiServerCallable.call(MultiServerCallable.java:43)
at
org.apache.hadoop.hbase.client.ServerCallable.withoutRetries(ServerCallable.java:250)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$7.call(HConnectionManager.java:1993)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$7.call(HConnectionManager.java:1988)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$Process$1.call(HConnectionManager.java:2473)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$Process$1.call(HConnectionManager.java:2463)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This
server is in the failed servers list:
a1007.halxg.cloudera.com/10.20.184.107:53752
at
org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(RpcClient.java:798)
at
org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1422)
at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1314)
... 15 more
{code}
> Powering down the server host holding the .META. table causes HBase Client to
> take excessively long to recover and connect to reassigned .META. table
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-6364
> URL: https://issues.apache.org/jira/browse/HBASE-6364
> Project: HBase
> Issue Type: Bug
> Components: Client
> Affects Versions: 0.90.6, 0.92.1, 0.94.0
> Reporter: Suraj Varma
> Assignee: Nicolas Liochon
> Labels: client
> Fix For: 0.94.2
>
> Attachments: 6364.94.v2.nolargetest.patch,
> 6364.94.v2.nolargetest.security-addendum.patch,
> 6364-host-serving-META.v1.patch, 6364.v11.nolargetest.patch, 6364.v1.patch,
> 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch,
> 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch,
> 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch,
> stacktrace.txt
>
>
> When a server host with a Region Server holding the .META. table is powered
> down on a live cluster, while the HBase cluster itself detects and reassigns
> the .META. table, connected HBase Client's take an excessively long time to
> detect this and re-discover the reassigned .META.
> Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low
> value (default is 20s leading to 35 minute recovery time; we were able to get
> acceptable results with 100ms getting a 3 minute recovery)
> This was found during some hardware failure testing scenarios.
> Test Case:
> 1) Apply load via client app on HBase cluster for several minutes
> 2) Power down the region server holding the .META. server (i.e. power off ...
> and keep it off)
> 3) Measure how long it takes for cluster to reassign META table and for
> client threads to re-lookup and re-orient to the lesser cluster (minus the RS
> and DN on that host).
> Observation:
> 1) Client threads spike up to maxThreads size ... and take over 35 mins to
> recover (i.e. for the thread count to go back to normal) - no client calls
> are serviced - they just back up on a synchronized method (see #2 below)
> 2) All the client app threads queue up behind the
> oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
> After taking several thread dumps we found that the thread within this
> synchronized method was blocked on NetUtils.connect(this.socket,
> remoteId.getAddress(), getSocketTimeout(conf));
> The client thread that gets the synchronized lock would try to connect to the
> dead RS (till socket times out after 20s), retries, and then the next thread
> gets in and so forth in a serial manner.
> Workaround:
> -------------------
> Default ipc.socket.timeout is set to 20s. We dropped this to a low number
> (1000 ms, 100 ms, etc) on the client side hbase-site.xml. With this setting,
> the client threads recovered in a couple of minutes by failing fast and
> re-discovering the .META. table on a reassigned RS.
> Assumption: This ipc.socket.timeout is only ever used during the initial
> "HConnection" setup via the NetUtils.connect and should only ever be used
> when connectivity to a region server is lost and needs to be re-established.
> i.e it does not affect the normal "RPC" actiivity as this is just the connect
> timeout.
> During RS GC periods, any _new_ clients trying to connect will fail and will
> require .META. table re-lookups.
> This above timeout workaround is only for the HBase client side.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira