[jira] Updated: (HBASE-1815) HBaseClient can get stuck in an infinite loop while attempting to contact a failed regionserver

stack (JIRA) Thu, 17 Sep 2009 17:14:22 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


stack updated HBASE-1815:
-------------------------

    Attachment: ipctimeout.patch

This patch might be a bit radical, but here it goes.

High-level motivation is undo retrying and sleeps down in ipc; let retrying be 
done at a higher level up in the hbase client.

In ipc, socket setup had a timeout of 20 seconds.  Ipc then retries the socket 
setup ten times with a 1 second sleep in between.  Thats 210seconds  or so 
before we timeout down in the guts of RPC.  We then go up to the retry logic in 
hbase (usually, not always), and then do ten retries with a 2 second retry in 
between (If a SocketTimeoutException exception setting up the connection, we'd 
retry a hard-coded 45 times; i.e. 15 minutes).

In Justin's case, I don't think we were doing SocketTimeoutException going by 
the stack trace.  It was more the 210 seconds per thread but my guess is  that 
his thrift client had probably timed out already.

This patch turns off retry down in the ipc client (let the upper-layers do 
retry), changes hard-coded sleep times to be hbase.client.pause time (2 
seconds), and removes the 45 hard-coding,   It also adds an hbase prefix to the 
ipc configuration parameters in case we want different values from hadoop.

Let me try out this patch.  My guess is that there are places in hbase where we 
don't retry because we were dependent on ipc doing retry for us.  Let me find 
those.

> HBaseClient can get stuck in an infinite loop while attempting to contact a 
> failed regionserver
> -----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1815
>                 URL: https://issues.apache.org/jira/browse/HBASE-1815
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.20.0
>         Environment: Ubuntu Linux (Linux <elided> 2.6.24-23-generic #1 SMP 
> Wed Apr 1 21:43:24 UTC 2009 x86_64 GNU/Linux), java version "1.6.0_06", 
> Java(TM) SE Runtime Environment (build 1.6.0_06-b02), Java HotSpot(TM) 64-Bit 
> Server VM (build 10.0-b22, mixed mode)
>            Reporter: Justin Lynn
>             Fix For: 0.20.1
>
>         Attachments: ipctimeout.patch, thrift_server_log_excerpt, 
> thrift_server_threaddump, thrift_server_threaddump_1
>
>
> While using HBase Thrift server, if a regionserver goes down due to shutdown 
> or failure clients will timeout because the thrift server cannot contact the 
> dead regionserver.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1815) HBaseClient can get stuck in an infinite loop while attempting to contact a failed regionserver

Reply via email to