[
https://issues.apache.org/jira/browse/HDFS-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14115304#comment-14115304
]
Yongjun Zhang commented on HDFS-6973:
-------------------------------------
Hi [~stevenxu], thanks for finding the reporting the issue. HBASE-9393 reported
same issue as you reported, however, this jira is still unresolved. I saw
[~cmccabe] did some anaylsis and gave suggestion there (Colin Patrick McCabe
added a comment - 11/Oct/13 19:25), which makes sense to me. I wonder if you
can try what he recommended there? Thanks.
> DFSClient does not closing a closed socket resulting in thousand of
> CLOSE_WAIT sockets
> --------------------------------------------------------------------------------------
>
> Key: HDFS-6973
> URL: https://issues.apache.org/jira/browse/HDFS-6973
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs-client
> Affects Versions: 2.4.0
> Environment: RHEL 6.3 -HDP 2.1 -6 RegionServers/Datanode -18T per
> node -3108Regions
> Reporter: steven xu
>
> HBase as HDFS Client dose not close a dead connection with the datanode.
> This resulting in over 30K+ CLOSE_WAIT and at some point HBase can not
> connect to the datanode because too many mapped sockets from one host to
> another on the same port:50010.
> After I restart all RSs, the count of CLOSE_WAIT will increase always.
> $ netstat -an|grep CLOSE_WAIT|wc -l
> 2545
> netstat -nap|grep CLOSE_WAIT|grep 6569|wc -l
> 2545
> ps -ef|grep 6569
> hbase 6569 6556 21 Aug25 ? 09:52:33 /opt/jdk1.6.0_25/bin/java
> -Dproc_regionserver -XX:OnOutOfMemoryError=kill -9 %p -Xmx1000m
> -XX:+UseConcMarkSweepGC
> I aslo have reviewed these issues:
> [HDFS-5697]
> [HDFS-5671]
> [HDFS-1836]
> [HBASE-9393]
> I found in HBase 0.98/Hadoop 2.4.0 source codes of these patchs have been
> added.
> But I donot understand why HBase 0.98/Hadoop 2.4.0 also have this isssue.
> Please check. Thanks a lot.
> These codes have been added into
> BlockReaderFactory.getRemoteBlockReaderFromTcp(). Another bug maybe lead my
> problem,
> {code:title=BlockReaderFactory.java|borderStyle=solid}
> // Some comments here
> private BlockReader getRemoteBlockReaderFromTcp() throws IOException {
> if (LOG.isTraceEnabled()) {
> LOG.trace(this + ": trying to create a remote block reader from a " +
> "TCP socket");
> }
> BlockReader blockReader = null;
> while (true) {
> BlockReaderPeer curPeer = null;
> Peer peer = null;
> try {
> curPeer = nextTcpPeer();
> if (curPeer == null) break;
> if (curPeer.fromCache) remainingCacheTries--;
> peer = curPeer.peer;
> blockReader = getRemoteBlockReader(peer);
> return blockReader;
> } catch (IOException ioe) {
> if (isSecurityException(ioe)) {
> if (LOG.isTraceEnabled()) {
> LOG.trace(this + ": got security exception while constructing " +
> "a remote block reader from " + peer, ioe);
> }
> throw ioe;
> }
> if ((curPeer != null) && curPeer.fromCache) {
> // Handle an I/O error we got when using a cached peer. These are
> // considered less serious, because the underlying socket may be
> // stale.
> if (LOG.isDebugEnabled()) {
> LOG.debug("Closed potentially stale remote peer " + peer, ioe);
> }
> } else {
> // Handle an I/O error we got when using a newly created peer.
> LOG.warn("I/O error constructing remote block reader.", ioe);
> throw ioe;
> }
> } finally {
> if (blockReader == null) {
> IOUtils.cleanup(LOG, peer);
> }
> }
> }
> return null;
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)