[ https://issues.apache.org/jira/browse/HBASE-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack updated HBASE-1815: ------------------------- Attachment: ipctimeout.patch This patch might be a bit radical, but here it goes. High-level motivation is undo retrying and sleeps down in ipc; let retrying be done at a higher level up in the hbase client. In ipc, socket setup had a timeout of 20 seconds. Ipc then retries the socket setup ten times with a 1 second sleep in between. Thats 210seconds or so before we timeout down in the guts of RPC. We then go up to the retry logic in hbase (usually, not always), and then do ten retries with a 2 second retry in between (If a SocketTimeoutException exception setting up the connection, we'd retry a hard-coded 45 times; i.e. 15 minutes). In Justin's case, I don't think we were doing SocketTimeoutException going by the stack trace. It was more the 210 seconds per thread but my guess is that his thrift client had probably timed out already. This patch turns off retry down in the ipc client (let the upper-layers do retry), changes hard-coded sleep times to be hbase.client.pause time (2 seconds), and removes the 45 hard-coding, It also adds an hbase prefix to the ipc configuration parameters in case we want different values from hadoop. Let me try out this patch. My guess is that there are places in hbase where we don't retry because we were dependent on ipc doing retry for us. Let me find those. > HBaseClient can get stuck in an infinite loop while attempting to contact a > failed regionserver > ----------------------------------------------------------------------------------------------- > > Key: HBASE-1815 > URL: https://issues.apache.org/jira/browse/HBASE-1815 > Project: Hadoop HBase > Issue Type: Bug > Components: client > Affects Versions: 0.20.0 > Environment: Ubuntu Linux (Linux <elided> 2.6.24-23-generic #1 SMP > Wed Apr 1 21:43:24 UTC 2009 x86_64 GNU/Linux), java version "1.6.0_06", > Java(TM) SE Runtime Environment (build 1.6.0_06-b02), Java HotSpot(TM) 64-Bit > Server VM (build 10.0-b22, mixed mode) > Reporter: Justin Lynn > Fix For: 0.20.1 > > Attachments: ipctimeout.patch, thrift_server_log_excerpt, > thrift_server_threaddump, thrift_server_threaddump_1 > > > While using HBase Thrift server, if a regionserver goes down due to shutdown > or failure clients will timeout because the thrift server cannot contact the > dead regionserver. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.