Suraj Varma created HBASE-6364:
----------------------------------

             Summary: Powering down the server host holding the .META. table 
causes HBase Client to take excessively long to recover and connect to 
reassigned .META. table
                 Key: HBASE-6364
                 URL: https://issues.apache.org/jira/browse/HBASE-6364
             Project: HBase
          Issue Type: Bug
          Components: client
    Affects Versions: 0.94.0, 0.92.1, 0.90.6
            Reporter: Suraj Varma


When a server host with a Region Server holding the .META. table is powered 
down on a live cluster, while the HBase cluster itself detects and reassigns 
the .META. table, connected HBase Client's take an excessively long time to 
detect this and re-discover the reassigned .META. 

Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
value (default is 20s leading to 35 minute recovery time; we were able to get 
acceptable results with 100ms getting a 3 minute recovery) 

This was found during some hardware failure testing scenarios. 

Test Case:
1) Apply load via client app on HBase cluster for several minutes
2) Power down the region server holding the .META. server (i.e. power off ... 
and keep it off)
3) Measure how long it takes for cluster to reassign META table and for client 
threads to re-lookup and re-orient to the lesser cluster (minus the RS and DN 
on that host).

Observation:
1) Client threads spike up to maxThreads size ... and take over 35 mins to 
recover (i.e. for the thread count to go back to normal) - no client calls are 
serviced - they just back up on a synchronized method (see #2 below)

2) All the client app threads queue up behind the 
oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj

After taking several thread dumps we found that the thread within this 
synchronized method was blocked on  NetUtils.connect(this.socket, 
remoteId.getAddress(), getSocketTimeout(conf));

The client thread that gets the synchronized lock would try to connect to the 
dead RS (till socket times out after 20s), retries, and then the next thread 
gets in and so forth in a serial manner.

Workaround:
-------------------
Default ipc.socket.timeout is set to 20s. We dropped this to a low number (1000 
ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, the 
client threads recovered in a couple of minutes by failing fast and 
re-discovering the .META. table on a reassigned RS.

Assumption: This ipc.socket.timeout is only ever used during the initial 
"HConnection" setup via the NetUtils.connect and should only ever be used when 
connectivity to a region server is lost and needs to be re-established. i.e it 
does not affect the normal "RPC" actiivity as this is just the connect timeout.
During RS GC periods, any _new_ clients trying to connect will fail and will 
require .META. table re-lookups.

This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to