[ 
https://issues.apache.org/jira/browse/HDFS-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694104#comment-14694104
 ] 

Xiaobing Zhou commented on HDFS-8855:
-------------------------------------

[~bobhansen]
I was using the sequential work load for results above.
{noformat}
#!/bin/bash
# segment, op=OPEN and offset are added to url_base                   
count=${count:-1000000}
#echo $count
url_base=${url_base:-"http://c6401.ambari.apache.org:50070/webhdfs/v1/tmp/bigfile"}
#echo $url_base
read_size=${read_size:-1}
#echo $read_size
file_size=${file_size:-$[ 1024 * 1024 * 1024 ]}
#echo $file_size
#namenode=${namenode:-`echo $url_base | grep -Po "(?<=http://)[^:/]*"`}
namenode="c6401.ambari.apache.org"
#echo $namenode

for i in `seq 1 $count` ; do
  rand=$(od -N 4 -t uL -An /dev/urandom | tr -d " ")
  #echo $rand
  offset=$[ ( $rand % (file_size / read_size) * read_size )]
  #echo $offset
  url=$url_base?op=OPEN\&offset=$offset\&length=$read_size
  #echo $url
  
  curl -L "$url" > url.blah 2>/dev/null
  #curl -L "$url"
  if (( $i % 100 == 0 )) ; then
    # Display the time
    echo -n "$i   ";
    date +%H:%M:%S.%N
    
    # Count the connections on the NameNode
    ssh vagrant@$namenode "file=/tmp/netstat.out ; netstat -a > \$file ; echo 
-n 'ESTABLISHED: '; echo -n \`grep -c ESTABLISHED \$file\` ; echo -n '  
TIME_WAIT: '; echo -n \`grep -c TIME_WAIT \$file\` ; echo -n '  CLOSE_WAIT: '; 
grep -c CLOSE_WAIT \$file"&
  fi
#  sleep $delay
done
{noformat}

By running one load generator, I got up to 2200 connections, two to 3200.

Also ran the concurrent workload, there's up to 7000 connections. Making 
file_size=1 and read_size=1 does not necessarily exacerbate it. I think it has 
something to do with my cluster. Three nodes being local VMs. NN/DN/SNN is 
equally deployed to 3 nodes.

Let's first try to work on the cache of org.apache.hadoop.ipc.connection to 
have a test again.



> Webhdfs client leaks active NameNode connections
> ------------------------------------------------
>
>                 Key: HDFS-8855
>                 URL: https://issues.apache.org/jira/browse/HDFS-8855
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: webhdfs
>         Environment: HDP 2.2
>            Reporter: Bob Hansen
>            Assignee: Xiaobing Zhou
>
> The attached script simulates a process opening ~50 files via webhdfs and 
> performing random reads.  Note that there are at most 50 concurrent reads, 
> and all webhdfs sessions are kept open.  Each read is ~64k at a random 
> position.  
> The script periodically (once per second) shells into the NameNode and 
> produces a summary of the socket states.  For my test cluster with 5 nodes, 
> it took ~30 seconds for the NameNode to have ~25000 active connections and 
> fails.
> It appears that each request to the webhdfs client is opening a new 
> connection to the NameNode and keeping it open after the request is complete. 
>  If the process continues to run, eventually (~30-60 seconds), all of the 
> open connections are closed and the NameNode recovers.  
> This smells like SoftReference reaping.  Are we using SoftReferences in the 
> webhdfs client to cache NameNode connections but never re-using them?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to