[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

stack (JIRA) Wed, 29 May 2013 22:02:18 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13670062#comment-13670062
 ]


stack commented on HBASE-6364:
------------------------------

[~nkeywal] What is supposed to happen when the FailedServerException is thrown?

The below is hard to read -- it is out of our loadtesttool used alot in 
hbase-it.

A loadtesttool thread is failing with a FailedServerException...  "This server 
is in the failed servers list'.  Reading the above, I thought we were supposed 
to pause and retry but if you notice, we are doing connection setup using 
withoutRetries... because we are inside a Process.

Any input would help.  Sorry for being short.  Have to run (I'm trying to 
figure failing hbase-it tests up on ec2 and on internal jenkins).  Thanks.

{code}
2013-05-27 17:08:53,193 ERROR [HBaseWriterThread_3] 
util.MultiThreadedWriter(191): Failed to insert: 48349; region information: 
cached: 
region=IntegrationTestDataIngestWithChaosMonkey,8cccccc4,1369699538187.6b2be2c004e633f8eed7de8aff6f7cfd.,
 hostname=a1007.halxg.cloudera.com,53752,1369699532487, seqNum=214684; cache is 
up to date; errors: Error from [a1007.halxg.cloudera.com:53752] for 
[979402d0d20fb0f8ded281a8b8687ab9-48349]java.io.IOException: Call to 
a1007.halxg.cloudera.com/10.20.184.107:53752 failed on local exception: 
org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in 
the failed servers list: a1007.halxg.cloudera.com/10.20.184.107:53752
        at 
org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClient.java:1368)er$HConnectionImplementation$Process$1.call(HConnectionManager.java:2463)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This 
server is in the failed servers list: 
a1007.halxg.cloudera.com/10.20.184.107:53752
        at 
org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(RpcClient.java:798)
        at 
org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1422)
        at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1314)
        ... 15 more

2013-05-27 17:08:53,193 ERROR [HBaseWriterThread_0] 
util.MultiThreadedWriter(191): Failed to insert: 48366; region information: 
cached: 
region=IntegrationTestDataIngestWithChaosMonkey,8cccccc4,1369699538187.6b2be2c004e633f8eed7de8aff6f7cfd.,
 hostname=a1007.halxg.cloudera.com,53752,1369699532487, seqNum=214684; cache is 
up to date; errors: Error from [a1007.halxg.cloudera.com:53752] for 
[92873a55c54f98db38508ba065852cc5-48366]java.io.IOException: Call to 
a1007.halxg.cloudera.com/10.20.184.107:53752 failed on local exception: 
org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in 
the failed servers list: a1007.halxg.cloudera.com/10.20.184.107:53752
        at 
org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClient.java:1368)
        at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1340)
        at 
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1540)
        at 
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1597)
        at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.multi(ClientProtos.java:21403)
        at 
org.apache.hadoop.hbase.client.MultiServerCallable.call(MultiServerCallable.java:102)
        at 
org.apache.hadoop.hbase.client.MultiServerCallable.call(MultiServerCallable.java:43)
        at 
org.apache.hadoop.hbase.client.ServerCallable.withoutRetries(ServerCallable.java:250)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$7.call(HConnectionManager.java:1993)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$7.call(HConnectionManager.java:1988)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$Process$1.call(HConnectionManager.java:2473)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$Process$1.call(HConnectionManager.java:2463)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This 
server is in the failed servers list: 
a1007.halxg.cloudera.com/10.20.184.107:53752
        at 
org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(RpcClient.java:798)
        at 
org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1422)
        at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1314)
        ... 15 more


{code}
                
> Powering down the server host holding the .META. table causes HBase Client to 
> take excessively long to recover and connect to reassigned .META. table
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6364
>                 URL: https://issues.apache.org/jira/browse/HBASE-6364
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 0.90.6, 0.92.1, 0.94.0
>            Reporter: Suraj Varma
>            Assignee: Nicolas Liochon
>              Labels: client
>             Fix For: 0.94.2
>
>         Attachments: 6364.94.v2.nolargetest.patch, 
> 6364.94.v2.nolargetest.security-addendum.patch, 
> 6364-host-serving-META.v1.patch, 6364.v11.nolargetest.patch, 6364.v1.patch, 
> 6364.v1.patch, 6364.v2.patch, 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 
> 6364.v5.withtests.patch, 6364.v6.patch, 6364.v6.withtests.patch, 
> 6364.v7.withtests.patch, 6364.v8.withtests.patch, 6364.v9.patch, 
> stacktrace.txt
>
>
> When a server host with a Region Server holding the .META. table is powered 
> down on a live cluster, while the HBase cluster itself detects and reassigns 
> the .META. table, connected HBase Client's take an excessively long time to 
> detect this and re-discover the reassigned .META. 
> Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
> value (default is 20s leading to 35 minute recovery time; we were able to get 
> acceptable results with 100ms getting a 3 minute recovery) 
> This was found during some hardware failure testing scenarios. 
> Test Case:
> 1) Apply load via client app on HBase cluster for several minutes
> 2) Power down the region server holding the .META. server (i.e. power off ... 
> and keep it off)
> 3) Measure how long it takes for cluster to reassign META table and for 
> client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
> and DN on that host).
> Observation:
> 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
> recover (i.e. for the thread count to go back to normal) - no client calls 
> are serviced - they just back up on a synchronized method (see #2 below)
> 2) All the client app threads queue up behind the 
> oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
> After taking several thread dumps we found that the thread within this 
> synchronized method was blocked on  NetUtils.connect(this.socket, 
> remoteId.getAddress(), getSocketTimeout(conf));
> The client thread that gets the synchronized lock would try to connect to the 
> dead RS (till socket times out after 20s), retries, and then the next thread 
> gets in and so forth in a serial manner.
> Workaround:
> -------------------
> Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
> (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
> the client threads recovered in a couple of minutes by failing fast and 
> re-discovering the .META. table on a reassigned RS.
> Assumption: This ipc.socket.timeout is only ever used during the initial 
> "HConnection" setup via the NetUtils.connect and should only ever be used 
> when connectivity to a region server is lost and needs to be re-established. 
> i.e it does not affect the normal "RPC" actiivity as this is just the connect 
> timeout.
> During RS GC periods, any _new_ clients trying to connect will fail and will 
> require .META. table re-lookups.
> This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

Reply via email to