[ 
https://issues.apache.org/jira/browse/HBASE-9393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792915#comment-13792915
 ] 

Colin Patrick McCabe commented on HBASE-9393:
---------------------------------------------

I looked into this issue.  I found a few things:

The HDFS socket cache is too small by default and times out too quickly.  Its 
default size is 16, but HBase seems to be opening many more connections to the 
DN than that.  In this situation, sockets must inevitably be opened and then 
discarded, leading to sockets in {{CLOSE_WAIT}}.

When you use positional read (aka {{pread}}), we grab a socket from the cache, 
read from it, and then immediately put it back.  When you seek and then call 
{{read}}, we don't put the socket back at the end.  The assumption behind the 
normal {{read}} method is that  you are probably going to call {{read}} again, 
so it holds on to the socket until something else comes up (such as closing the 
stream).  In many scenarios, this can lead to {{seek+read}} generating more 
sockets in {{CLOSE_WAIT}} than {{pread}}.

I don't think we want to alter this HDFS behavior, since it's helpful in the 
case that you're reading through the entire file from start to finish-- which 
many HDFS clients do.  It allows us to make certain optimizations such as 
reading a few kilobytes at a time, even if the user only asks for a few bytes 
at a time.  These optimizations are unavailable with {{pread}} because it 
creates a new {{BlockReader}} each time.

So as far as recommendations for HBase go:
* use short-circuit reads whenever possible, since in many cases you can avoid 
needing a socket at all and just reuse the same file descriptor
* set the socket cache to a bigger size and adjust the timeouts to be longer (I 
may explore changing the defaults in HDFS...)
* if you are going to keep files open for a while and random read, use 
{{pread}}, never {{seek+read}}.

> Hbase dose not closing a closed socket resulting in many CLOSE_WAIT 
> --------------------------------------------------------------------
>
>                 Key: HBASE-9393
>                 URL: https://issues.apache.org/jira/browse/HBASE-9393
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.2
>         Environment: Centos 6.4 - 7 regionservers/datanodes, 8 TB per node, 
> 7279 regions
>            Reporter: Avi Zrachya
>
> HBase dose not close a dead connection with the datanode.
> This resulting in over 60K CLOSE_WAIT and at some point HBase can not connect 
> to the datanode because too many mapped sockets from one host to another on 
> the same port.
> The example below is with low CLOSE_WAIT count because we had to restart 
> hbase to solve the porblem, later in time it will incease to 60-100K sockets 
> on CLOSE_WAIT
> [root@hd2-region3 ~]# netstat -nap |grep CLOSE_WAIT |grep 21592 |wc -l
> 13156
> [root@hd2-region3 ~]# ps -ef |grep 21592
> root     17255 17219  0 12:26 pts/0    00:00:00 grep 21592
> hbase    21592     1 17 Aug29 ?        03:29:06 
> /usr/java/jdk1.6.0_26/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx8000m 
> -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode 
> -Dhbase.log.dir=/var/log/hbase 
> -Dhbase.log.file=hbase-hbase-regionserver-hd2-region3.swnet.corp.log ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to