[jira] [Comment Edited] (HDDS-3600) ManagedChannels leaked on ratis pipeline when there are many connection retries

Attila Doroszlai (Jira) Mon, 15 May 2023 12:33:18 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722914#comment-17722914
 ]


Attila Doroszlai edited comment on HDDS-3600 at 5/15/23 7:32 PM:
-----------------------------------------------------------------

{{ManagedChannelImpl}} no longer seems to be problem:

{code}
 num     #instances         #bytes  class name
----------------------------------------------
   1:         47687     2407791208  [B
   2:         15469      435022096  [I
   3:        215271       16059032  [C
   4:        112218        5386464  org.apache.ratis.util.TimeoutTimer$Task
   5:        168665        4047960  
java.util.concurrent.ConcurrentSkipListMap$Node
   6:        144808        3475392  java.lang.String
   7:         46044        3075656  [Ljava.lang.Object;
   8:        112218        2693232  org.apache.ratis.util.LogUtils$1
   9:        112218        2693232  
org.apache.ratis.util.TimeoutExecutor$$Lambda$356/123612721
  10:        112218        2693232  
org.apache.ratis.util.TimeoutTimer$$Lambda$358/1179087374
  11:        101312        2431488  
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$$Lambda$353/838735040
  12:        101312        2431488  
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$$Lambda$355/1787381748
  13:         96163        2307912  java.lang.Long
  14:        123498        1975968  java.lang.Object
  15:        112218        1795488  
org.apache.ratis.util.TimeoutTimer$Task$$Lambda$360/1055953892
  16:         56623        1358952  
org.apache.hadoop.hdds.protocol.DatanodeDetails$Port
  17:            37        1213008  [Ljava.util.concurrent.ForkJoinTask;
  18:          9886        1028144  
org.apache.hadoop.hdds.protocol.DatanodeDetails
...
 664:             6           1584  
org.apache.ratis.thirdparty.io.grpc.internal.ManagedChannelImpl
{code}


was (Author: adoroszlai):
{{ManagedChannelImpl}} no longer seems to be problem:

{code}
 num     #instances         #bytes  class name
----------------------------------------------
   1:         47687     2407791208  [B
   2:         15469      435022096  [I
   3:        215271       16059032  [C
   4:        112218        5386464  org.apache.ratis.util.TimeoutTimer$Task
   5:        168665        4047960  
java.util.concurrent.ConcurrentSkipListMap$Node
   6:        144808        3475392  java.lang.String
   7:         46044        3075656  [Ljava.lang.Object;
   8:        112218        2693232  org.apache.ratis.util.LogUtils$1
   9:        112218        2693232  
org.apache.ratis.util.TimeoutExecutor$$Lambda$356/123612721
  10:        112218        2693232  
org.apache.ratis.util.TimeoutTimer$$Lambda$358/1179087374
  11:        101312        2431488  
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$$Lambda$353/838735040
  12:        101312        2431488  
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$$Lambda$355/1787381748
  13:         96163        2307912  java.lang.Long
  14:        123498        1975968  java.lang.Object
  15:        112218        1795488  
org.apache.ratis.util.TimeoutTimer$Task$$Lambda$360/1055953892
  16:         56623        1358952  
org.apache.hadoop.hdds.protocol.DatanodeDetails$Port
  17:            37        1213008  [Ljava.util.concurrent.ForkJoinTask;
  18:          9886        1028144  
org.apache.hadoop.hdds.protocol.DatanodeDetails
  19:         62976        1007616  java.lang.Integer
  20:          1284         914208  
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OMRequest$Builder
  21:          7441         837000  java.lang.Class
  22:          8719         767272  java.lang.reflect.Method
  23:         20512         656384  java.util.concurrent.ConcurrentHashMap$Node
  24:         26634         639216  java.util.ArrayList
  25:         16439         526048  java.util.Hashtable$Entry
...
 664:             6           1584  
org.apache.ratis.thirdparty.io.grpc.internal.ManagedChannelImpl
{code}

> ManagedChannels leaked on ratis pipeline when there are many connection 
> retries
> -------------------------------------------------------------------------------
>
>                 Key: HDDS-3600
>                 URL: https://issues.apache.org/jira/browse/HDDS-3600
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Client
>    Affects Versions: 1.0.0
>            Reporter: Rakesh Radhakrishnan
>            Assignee: Attila Doroszlai
>            Priority: Critical
>              Labels: TriagePending
>         Attachments: HeapHistogram-Snapshot-ManagedChannel-Leaked-001.png, 
> jmap.histo, outloggenerator-ozonefs-003.log
>
>
> ManagedChannels leaked on ratis pipeline when there are many connection 
> retries
> Observed that too many ManagedChannels opened while running Synthetic Hadoop 
> load generator.
>  Ran benchmark with only one pipeline in the cluster and also ran with only 
> two pipelines in the cluster. 
>  Both the run failed with too many open files and could see many open TCP 
> connections for long time and suspecting channel leaks..
> More details below:
>  *1)* Execute NNloadGenerator
> {code:java}
> [rakeshr@ve1320 loadOutput]$ ps -ef | grep load
> hdfs     362822      1 19 05:24 pts/0    00:03:16 
> /usr/java/jdk1.8.0_232-cloudera/bin/java -Dproc_jar -Xmx825955249 
> -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true 
> -Dyarn.log.dir=/var/log/hadoop-yarn -Dyarn.log.file=hadoop.log 
> -Dyarn.home.dir=/opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/lib/hadoop/libexec/../../hadoop-yarn
>  -Dyarn.root.logger=INFO,console 
> -Djava.library.path=/opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/lib/hadoop/lib/native
>  -Dhadoop.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=hadoop.log 
> -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/lib/hadoop
>  -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console 
> -Dhadoop.policy.file=hadoop-policy.xml 
> -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar 
> /opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/jars/hadoop-mapreduce-client-jobclient-3.1.1.7.2.0.0-141-tests.jar
>  NNloadGenerator -root o3fs://bucket2.vol2/
> rakeshr  368739 354174  0 05:41 pts/0    00:00:00 grep --color=auto load
> {code}
> *2)* Active 9858 TCP connections during the run, which is ratis pipeline 
> default port.
> {code:java}
> [rakeshr@ve1320 loadOutput]$ sudo lsof -a -p 362822 | grep "9858" | wc
>    3229   32290  494080
> [rakeshr@ve1320 loadOutput]$ vi tcp_log
> ............
> java    440633 hdfs 4090u     IPv4          271141987       0t0        TCP 
> ve1320.halxg.cloudera.com:35190->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> java    440633 hdfs 4091u     IPv4          271127918       0t0        TCP 
> ve1320.halxg.cloudera.com:35192->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> java    440633 hdfs 4092u     IPv4          271038583       0t0        TCP 
> ve1320.halxg.cloudera.com:59116->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> java    440633 hdfs 4093u     IPv4          271038584       0t0        TCP 
> ve1320.halxg.cloudera.com:59118->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> java    440633 hdfs 4095u     IPv4          271127920       0t0        TCP 
> ve1320.halxg.cloudera.com:35196->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> [rakeshr@ve1320 loadOutput]$ ^C
>  {code}
> *3)* heapdump shows there are 9571 ManagedChanel objects. Heapdump is quite 
> large and attached snapshot to this jira.
> *4)* Attached output and threadump of the SyntheticLoadGenerator benchmark 
> client process to show the exceptions printed to the console. FYI, this file 
> was quite large and have trimmed few repeated exception traces..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDDS-3600) ManagedChannels leaked on ratis pipeline when there are many connection retries

Reply via email to