[ https://issues.apache.org/jira/browse/HADOOP-4346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637162#action_12637162 ]
Raghu Angadi commented on HADOOP-4346: -------------------------------------- The following shows relevant info from jmap for a datanode that had a lot fds open. - {noformat} #jmap with out full-GC. Includes stale objects: # num of fds for the process : 5358 #java internal selectors 117: 1780 42720 sun.nio.ch.Util$SelectorWrapper 118: 1762 42288 sun.nio.ch.Util$SelectorWrapper$Closer #Hadoop selectors 93: 3026 121040 org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool$SelectorInfo 844: 1 40 org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool$ProviderInfo #Datanode threads 99: 2229 106992 org.apache.hadoop.dfs.DataNode$DataXceiver {noformat} - {noformat} #jmap -histo:live immediately after the previous. This does a full-GC before counting. #num of fds : 5187 64: 1759 42216 sun.nio.ch.Util$SelectorWrapper 65: 1759 42216 sun.nio.ch.Util$SelectorWrapper$Closer 465: 4 160 org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool$SelectorInfo 772: 1 40 org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool$ProviderInfo 422: 4 192 org.apache.hadoop.dfs.DataNode$DataXceiver {noformat} This shows that there is no fd leak in Hadoop's selector cache. DN has 4 threads doing I/O and there are 4 selectors. But there are a lot of java internal selectors open. - {noformat} # 'jmap -histo:live' bout 1 minute after the previous full-GC #num of fds : 57 # There are no SelectorWrapper objects. All of these must have been closed. 768: 1 40 org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool$SelectorInfo 730: 1 48 org.apache.hadoop.dfs.DataNode$DataXceiver {noformat} I will try to reproduced this myself and try out a patch for connect(). > Hadoop triggers a "soft" fd leak. > ---------------------------------- > > Key: HADOOP-4346 > URL: https://issues.apache.org/jira/browse/HADOOP-4346 > Project: Hadoop Core > Issue Type: Bug > Components: io > Affects Versions: 0.17.0 > Reporter: Raghu Angadi > > Starting with Hadoop-0.17, most of the network I/O uses non-blocking NIO > channels. Normal blocking reads and writes are handled by Hadoop and use our > own cache of selectors. This cache suites well for Hadoop where I/O often > occurs on many short lived threads. Number of fds consumed is proportional to > number of threads currently blocked. > If blocking I/O is done using java.*, Sun's implementation uses internal > per-thread selectors. These selectors are closed using {{sun.misc.Cleaner}}. > Looks like this cleaning is kind of like finalizers and tied to GC. This is > pretty ill suited if we have many threads that are short lived. Until GC > happens, number of these selectors keeps growing. Each selector consumes 3 > fds. > Though blocking read and write are handled by Hadoop, {{connect()}} is still > the default implementation that uses per-thread selector. > Koji helped a lot in tracking this. Some sections from 'jmap' output and > other info Koji collected led to this suspicion and will include that in the > next comment. > One solution might be to handle connect() also in Hadoop using our selectors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.