[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13670062#comment-13670062
 ] 

stack commented on HBASE-6364:
------------------------------

[~nkeywal] What is supposed to happen when the FailedServerException is thrown?

The below is hard to read -- it is out of our loadtesttool used alot in 
hbase-it.

A loadtesttool thread is failing with a FailedServerException...  "This server 
is in the failed servers list'.  Reading the above, I thought we were supposed 
to pause and retry but if you notice, we are doing connection setup using 
withoutRetries... because we are inside a Process.

Any input would help.  Sorry for being short.  Have to run (I'm trying to 
figure failing hbase-it tests up on ec2 and on internal jenkins).  Thanks.

{code}
2013-05-27 17:08:53,193 ERROR [HBaseWriterThread_3] 
util.MultiThreadedWriter(191): Failed to insert: 48349; region information: 
cached: 
region=IntegrationTestDataIngestWithChaosMonkey,8cccccc4,1369699538187.6b2be2c004e633f8eed7de8aff6f7cfd.,
 hostname=a1007.halxg.cloudera.com,53752,1369699532487, seqNum=214684; cache is 
up to date; errors: Error from [a1007.halxg.cloudera.com:53752] for 
[979402d0d20fb0f8ded281a8b8687ab9-48349]java.io.IOException: Call to 
a1007.halxg.cloudera.com/10.20.184.107:53752 failed on local exception: 
org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in 
the failed servers list: a1007.halxg.cloudera.com/10.20.184.107:53752
        at 
org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClient.java:1368)er$HConnectionImplementation$Process$1.call(HConnectionManager.java:2463)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This 
server is in the failed servers list: 
a1007.halxg.cloudera.com/10.20.184.107:53752
        at 
org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(RpcClient.java:798)
        at 
org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1422)
        at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1314)
        ... 15 more

2013-05-27 17:08:53,193 ERROR [HBaseWriterThread_0] 
util.MultiThreadedWriter(191): Failed to insert: 48366; region information: 
cached: 
region=IntegrationTestDataIngestWithChaosMonkey,8cccccc4,1369699538187.6b2be2c004e633f8eed7de8aff6f7cfd.,
 hostname=a1007.halxg.cloudera.com,53752,1369699532487, seqNum=214684; cache is 
up to date; errors: Error from [a1007.halxg.cloudera.com:53752] for 
[92873a55c54f98db38508ba065852cc5-48366]java.io.IOException: Call to 
a1007.halxg.cloudera.com/10.20.184.107:53752 failed on local exception: 
org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in 
the failed servers list: a1007.halxg.cloudera.com/10.20.184.107:53752
        at 
org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClient.java:1368)
        at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1340)
        at 
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1540)
        at 
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1597)
        at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.multi(ClientProtos.java:21403)
        at 
org.apache.hadoop.hbase.client.MultiServerCallable.call(MultiServerCallable.java:102)
        at 
org.apache.hadoop.hbase.client.MultiServerCallable.call(MultiServerCallable.java:43)
        at 
org.apache.hadoop.hbase.client.ServerCallable.withoutRetries(ServerCallable.java:250)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$7.call(HConnectionManager.java:1993)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$7.call(HConnectionManager.java:1988)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$Process$1.call(HConnectionManager.java:2473)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$Process$1.call(HConnectionManager.java:2463)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This 
server is in the failed servers list: 
a1007.halxg.cloudera.com/10.20.184.107:53752
        at 
org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(RpcClient.java:798)
        at 
org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1422)
        at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1314)
        ... 15 more


{code}
                
> Powering down the server host holding the .META. table causes HBase Client to 
> take excessively long to recover and connect to reassigned .META. table
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6364
>                 URL: https://issues.apache.org/jira/browse/HBASE-6364
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 0.90.6, 0.92.1, 0.94.0
>            Reporter: Suraj Varma
>            Assignee: Nicolas Liochon
>              Labels: client
>             Fix For: 0.94.2
>
>         Attachments: 6364.94.v2.nolargetest.patch, 
> 6364.94.v2.nolargetest.security-addendum.patch, 
> 6364-host-serving-META.v1.patch, 6364.v11.nolargetest.patch, 6364.v1.patch, 
> 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
> 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
> 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
> stacktrace.txt
>
>
> When a server host with a Region Server holding the .META. table is powered 
> down on a live cluster, while the HBase cluster itself detects and reassigns 
> the .META. table, connected HBase Client's take an excessively long time to 
> detect this and re-discover the reassigned .META. 
> Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
> value (default is 20s leading to 35 minute recovery time; we were able to get 
> acceptable results with 100ms getting a 3 minute recovery) 
> This was found during some hardware failure testing scenarios. 
> Test Case:
> 1) Apply load via client app on HBase cluster for several minutes
> 2) Power down the region server holding the .META. server (i.e. power off ... 
> and keep it off)
> 3) Measure how long it takes for cluster to reassign META table and for 
> client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
> and DN on that host).
> Observation:
> 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
> recover (i.e. for the thread count to go back to normal) - no client calls 
> are serviced - they just back up on a synchronized method (see #2 below)
> 2) All the client app threads queue up behind the 
> oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
> After taking several thread dumps we found that the thread within this 
> synchronized method was blocked on  NetUtils.connect(this.socket, 
> remoteId.getAddress(), getSocketTimeout(conf));
> The client thread that gets the synchronized lock would try to connect to the 
> dead RS (till socket times out after 20s), retries, and then the next thread 
> gets in and so forth in a serial manner.
> Workaround:
> -------------------
> Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
> (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
> the client threads recovered in a couple of minutes by failing fast and 
> re-discovering the .META. table on a reassigned RS.
> Assumption: This ipc.socket.timeout is only ever used during the initial 
> "HConnection" setup via the NetUtils.connect and should only ever be used 
> when connectivity to a region server is lost and needs to be re-established. 
> i.e it does not affect the normal "RPC" actiivity as this is just the connect 
> timeout.
> During RS GC periods, any _new_ clients trying to connect will fail and will 
> require .META. table re-lookups.
> This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to