[
https://issues.apache.org/jira/browse/HBASE-9393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792915#comment-13792915
]
Colin Patrick McCabe commented on HBASE-9393:
---------------------------------------------
I looked into this issue. I found a few things:
The HDFS socket cache is too small by default and times out too quickly. Its
default size is 16, but HBase seems to be opening many more connections to the
DN than that. In this situation, sockets must inevitably be opened and then
discarded, leading to sockets in {{CLOSE_WAIT}}.
When you use positional read (aka {{pread}}), we grab a socket from the cache,
read from it, and then immediately put it back. When you seek and then call
{{read}}, we don't put the socket back at the end. The assumption behind the
normal {{read}} method is that you are probably going to call {{read}} again,
so it holds on to the socket until something else comes up (such as closing the
stream). In many scenarios, this can lead to {{seek+read}} generating more
sockets in {{CLOSE_WAIT}} than {{pread}}.
I don't think we want to alter this HDFS behavior, since it's helpful in the
case that you're reading through the entire file from start to finish-- which
many HDFS clients do. It allows us to make certain optimizations such as
reading a few kilobytes at a time, even if the user only asks for a few bytes
at a time. These optimizations are unavailable with {{pread}} because it
creates a new {{BlockReader}} each time.
So as far as recommendations for HBase go:
* use short-circuit reads whenever possible, since in many cases you can avoid
needing a socket at all and just reuse the same file descriptor
* set the socket cache to a bigger size and adjust the timeouts to be longer (I
may explore changing the defaults in HDFS...)
* if you are going to keep files open for a while and random read, use
{{pread}}, never {{seek+read}}.
> Hbase dose not closing a closed socket resulting in many CLOSE_WAIT
> --------------------------------------------------------------------
>
> Key: HBASE-9393
> URL: https://issues.apache.org/jira/browse/HBASE-9393
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.94.2
> Environment: Centos 6.4 - 7 regionservers/datanodes, 8 TB per node,
> 7279 regions
> Reporter: Avi Zrachya
>
> HBase dose not close a dead connection with the datanode.
> This resulting in over 60K CLOSE_WAIT and at some point HBase can not connect
> to the datanode because too many mapped sockets from one host to another on
> the same port.
> The example below is with low CLOSE_WAIT count because we had to restart
> hbase to solve the porblem, later in time it will incease to 60-100K sockets
> on CLOSE_WAIT
> [root@hd2-region3 ~]# netstat -nap |grep CLOSE_WAIT |grep 21592 |wc -l
> 13156
> [root@hd2-region3 ~]# ps -ef |grep 21592
> root 17255 17219 0 12:26 pts/0 00:00:00 grep 21592
> hbase 21592 1 17 Aug29 ? 03:29:06
> /usr/java/jdk1.6.0_26/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx8000m
> -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode
> -Dhbase.log.dir=/var/log/hbase
> -Dhbase.log.file=hbase-hbase-regionserver-hd2-region3.swnet.corp.log ...
--
This message was sent by Atlassian JIRA
(v6.1#6144)