[ 
https://issues.apache.org/jira/browse/HDFS-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702188#comment-14702188
 ] 

Bob Hansen commented on HDFS-8855:
----------------------------------

Jitendra and [~daryn] both hypothesized that there is a missed equality check 
for the UserGroupInfo in the RPC lookup. I believe the theory is that the 
DataNodeUGIProvider is creating a new ugi each time, and something is checking 
if newUgi==oldUgi rather than newUgi.equals(oldUgi).  That might be a good 
place to look, [~wheat9]

[~xiaobingo]: rather than having a static cache of clients, perhaps we should 
match the client lifecycle to the http session.  We can store the client 
reference in ChannelHandlerContext attributes, and catch the channelInactive 
event in the WebHdfsHandler to close the client.  Of course, we need to check 
that the UGIs match and make sure that the operations don't close the client 
before the session ends.

I have a prototype of that, but haven't been able to test it yet.

> Webhdfs client leaks active NameNode connections
> ------------------------------------------------
>
>                 Key: HDFS-8855
>                 URL: https://issues.apache.org/jira/browse/HDFS-8855
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: webhdfs
>         Environment: HDP 2.2
>            Reporter: Bob Hansen
>            Assignee: Xiaobing Zhou
>         Attachments: HDFS-8855.1.patch
>
>
> The attached script simulates a process opening ~50 files via webhdfs and 
> performing random reads.  Note that there are at most 50 concurrent reads, 
> and all webhdfs sessions are kept open.  Each read is ~64k at a random 
> position.  
> The script periodically (once per second) shells into the NameNode and 
> produces a summary of the socket states.  For my test cluster with 5 nodes, 
> it took ~30 seconds for the NameNode to have ~25000 active connections and 
> fails.
> It appears that each request to the webhdfs client is opening a new 
> connection to the NameNode and keeping it open after the request is complete. 
>  If the process continues to run, eventually (~30-60 seconds), all of the 
> open connections are closed and the NameNode recovers.  
> This smells like SoftReference reaping.  Are we using SoftReferences in the 
> webhdfs client to cache NameNode connections but never re-using them?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to