[
https://issues.apache.org/jira/browse/HDDS-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722914#comment-17722914
]
Attila Doroszlai edited comment on HDDS-3600 at 5/15/23 7:32 PM:
-----------------------------------------------------------------
{{ManagedChannelImpl}} no longer seems to be problem:
{code}
num #instances #bytes class name
----------------------------------------------
1: 47687 2407791208 [B
2: 15469 435022096 [I
3: 215271 16059032 [C
4: 112218 5386464 org.apache.ratis.util.TimeoutTimer$Task
5: 168665 4047960
java.util.concurrent.ConcurrentSkipListMap$Node
6: 144808 3475392 java.lang.String
7: 46044 3075656 [Ljava.lang.Object;
8: 112218 2693232 org.apache.ratis.util.LogUtils$1
9: 112218 2693232
org.apache.ratis.util.TimeoutExecutor$$Lambda$356/123612721
10: 112218 2693232
org.apache.ratis.util.TimeoutTimer$$Lambda$358/1179087374
11: 101312 2431488
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$$Lambda$353/838735040
12: 101312 2431488
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$$Lambda$355/1787381748
13: 96163 2307912 java.lang.Long
14: 123498 1975968 java.lang.Object
15: 112218 1795488
org.apache.ratis.util.TimeoutTimer$Task$$Lambda$360/1055953892
16: 56623 1358952
org.apache.hadoop.hdds.protocol.DatanodeDetails$Port
17: 37 1213008 [Ljava.util.concurrent.ForkJoinTask;
18: 9886 1028144
org.apache.hadoop.hdds.protocol.DatanodeDetails
...
664: 6 1584
org.apache.ratis.thirdparty.io.grpc.internal.ManagedChannelImpl
{code}
was (Author: adoroszlai):
{{ManagedChannelImpl}} no longer seems to be problem:
{code}
num #instances #bytes class name
----------------------------------------------
1: 47687 2407791208 [B
2: 15469 435022096 [I
3: 215271 16059032 [C
4: 112218 5386464 org.apache.ratis.util.TimeoutTimer$Task
5: 168665 4047960
java.util.concurrent.ConcurrentSkipListMap$Node
6: 144808 3475392 java.lang.String
7: 46044 3075656 [Ljava.lang.Object;
8: 112218 2693232 org.apache.ratis.util.LogUtils$1
9: 112218 2693232
org.apache.ratis.util.TimeoutExecutor$$Lambda$356/123612721
10: 112218 2693232
org.apache.ratis.util.TimeoutTimer$$Lambda$358/1179087374
11: 101312 2431488
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$$Lambda$353/838735040
12: 101312 2431488
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$$Lambda$355/1787381748
13: 96163 2307912 java.lang.Long
14: 123498 1975968 java.lang.Object
15: 112218 1795488
org.apache.ratis.util.TimeoutTimer$Task$$Lambda$360/1055953892
16: 56623 1358952
org.apache.hadoop.hdds.protocol.DatanodeDetails$Port
17: 37 1213008 [Ljava.util.concurrent.ForkJoinTask;
18: 9886 1028144
org.apache.hadoop.hdds.protocol.DatanodeDetails
19: 62976 1007616 java.lang.Integer
20: 1284 914208
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OMRequest$Builder
21: 7441 837000 java.lang.Class
22: 8719 767272 java.lang.reflect.Method
23: 20512 656384 java.util.concurrent.ConcurrentHashMap$Node
24: 26634 639216 java.util.ArrayList
25: 16439 526048 java.util.Hashtable$Entry
...
664: 6 1584
org.apache.ratis.thirdparty.io.grpc.internal.ManagedChannelImpl
{code}
> ManagedChannels leaked on ratis pipeline when there are many connection
> retries
> -------------------------------------------------------------------------------
>
> Key: HDDS-3600
> URL: https://issues.apache.org/jira/browse/HDDS-3600
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Client
> Affects Versions: 1.0.0
> Reporter: Rakesh Radhakrishnan
> Assignee: Attila Doroszlai
> Priority: Critical
> Labels: TriagePending
> Attachments: HeapHistogram-Snapshot-ManagedChannel-Leaked-001.png,
> jmap.histo, outloggenerator-ozonefs-003.log
>
>
> ManagedChannels leaked on ratis pipeline when there are many connection
> retries
> Observed that too many ManagedChannels opened while running Synthetic Hadoop
> load generator.
> Ran benchmark with only one pipeline in the cluster and also ran with only
> two pipelines in the cluster.
> Both the run failed with too many open files and could see many open TCP
> connections for long time and suspecting channel leaks..
> More details below:
> *1)* Execute NNloadGenerator
> {code:java}
> [rakeshr@ve1320 loadOutput]$ ps -ef | grep load
> hdfs 362822 1 19 05:24 pts/0 00:03:16
> /usr/java/jdk1.8.0_232-cloudera/bin/java -Dproc_jar -Xmx825955249
> -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true
> -Dyarn.log.dir=/var/log/hadoop-yarn -Dyarn.log.file=hadoop.log
> -Dyarn.home.dir=/opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/lib/hadoop/libexec/../../hadoop-yarn
> -Dyarn.root.logger=INFO,console
> -Djava.library.path=/opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/lib/hadoop/lib/native
> -Dhadoop.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=hadoop.log
> -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/lib/hadoop
> -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console
> -Dhadoop.policy.file=hadoop-policy.xml
> -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar
> /opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/jars/hadoop-mapreduce-client-jobclient-3.1.1.7.2.0.0-141-tests.jar
> NNloadGenerator -root o3fs://bucket2.vol2/
> rakeshr 368739 354174 0 05:41 pts/0 00:00:00 grep --color=auto load
> {code}
> *2)* Active 9858 TCP connections during the run, which is ratis pipeline
> default port.
> {code:java}
> [rakeshr@ve1320 loadOutput]$ sudo lsof -a -p 362822 | grep "9858" | wc
> 3229 32290 494080
> [rakeshr@ve1320 loadOutput]$ vi tcp_log
> ............
> java 440633 hdfs 4090u IPv4 271141987 0t0 TCP
> ve1320.halxg.cloudera.com:35190->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> java 440633 hdfs 4091u IPv4 271127918 0t0 TCP
> ve1320.halxg.cloudera.com:35192->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> java 440633 hdfs 4092u IPv4 271038583 0t0 TCP
> ve1320.halxg.cloudera.com:59116->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> java 440633 hdfs 4093u IPv4 271038584 0t0 TCP
> ve1320.halxg.cloudera.com:59118->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> java 440633 hdfs 4095u IPv4 271127920 0t0 TCP
> ve1320.halxg.cloudera.com:35196->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> [rakeshr@ve1320 loadOutput]$ ^C
> {code}
> *3)* heapdump shows there are 9571 ManagedChanel objects. Heapdump is quite
> large and attached snapshot to this jira.
> *4)* Attached output and threadump of the SyntheticLoadGenerator benchmark
> client process to show the exceptions printed to the console. FYI, this file
> was quite large and have trimmed few repeated exception traces..
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]