Wei-Chiu Chuang created HDFS-12737:
--------------------------------------
Summary: Thousands of sockets lingering in TIME_WAIT state due to
frequent file open operations
Key: HDFS-12737
URL: https://issues.apache.org/jira/browse/HDFS-12737
Project: Hadoop HDFS
Issue Type: Bug
Components: ipc
Environment: CDH5.10.2, HBase Multi-WAL=2, 250 replication peers
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang
On a HBase cluster we found HBase RegionServers have thousands of sockets in
TIME_WAIT state. It depleted system resources and caused other services to fail.
After months of troubleshooting, we found the issue is the cluster has hundreds
of replication peers, and has multi-WAL = 2. That creates hundreds of
replication threads in HBase RS, and each thread opens WAL file *every second*.
We found that the IPC client closes socket right away, and does not reuse
socket connection. Since each closed socket stays in TIME_WAIT state for 60
seconds in Linux by default, that generates thousands of TIME_WAIT sockets.
{code:title=ClientDatanodeProtocolTranslatorPB:createClientDatanodeProtocolProxy}
// Since we're creating a new UserGroupInformation here, we know that no
// future RPC proxies will be able to re-use the same connection. And
// usages of this proxy tend to be one-off calls.
//
// This is a temporary fix: callers should really achieve this by using
// RPC.stopProxy() on the resulting object, but this is currently not
// working in trunk. See the discussion on HDFS-1965.
Configuration confWithNoIpcIdle = new Configuration(conf);
confWithNoIpcIdle.setInt(CommonConfigurationKeysPublic
.IPC_CLIENT_CONNECTION_MAXIDLETIME_KEY, 0);
{code}
Unfortunately, given the HBase's usage pattern, this hack creates the problem.
Ignoring the fact that having hundreds of HBase replication peers is a bad
practice (I'll probably file a HBASE jira to fix that), the fact that Hadoop
IPC client does not reuse socket seems not right. The relevant code is
historical and deep in the stack, so I'd like to invite comments. I have a
patch but it's pretty hacky.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]