[ 
https://issues.apache.org/jira/browse/KAFKA-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594320#comment-16594320
 ] 

Yu Yang commented on KAFKA-7304:
--------------------------------

After more experiments, we currently think that the issue is caused by too many 
idle ssl connections that are not closed on time. 

I set up a test cluster of 24 brokers using d2.8xlarge instances, allocated 
32gb for kafka process heap space, and have ~40k clients writes to a test topic 
on this cluster. The following graph shows the jvm heap usage and gc activity 
in the past 24 hours or so. The cluster ran fine with low heap usage and low 
cpu usage.  However, the heap usage and the cpu usage of brokers increased 
sharply when we added or terminated brokers in this cluster (for broker 
termination, there was no topic partitions allocated on those terminated 
nodes).  

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMjcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yMi01NC01OA==

The cluster can be recovered after we turned off the ssl writing traffic to the 
cluster, let the broker to garbage collect the objects in the old gen, and 
resume the ssl writing traffic. 

> memory leakage in org.apache.kafka.common.network.Selector
> ----------------------------------------------------------
>
>                 Key: KAFKA-7304
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7304
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.1.0, 1.1.1
>            Reporter: Yu Yang
>            Priority: Critical
>             Fix For: 1.1.2, 2.0.1, 2.1.0
>
>         Attachments: 7304.v4.txt, 7304.v7.txt, Screen Shot 2018-08-16 at 
> 11.04.16 PM.png, Screen Shot 2018-08-16 at 11.06.38 PM.png, Screen Shot 
> 2018-08-16 at 12.41.26 PM.png, Screen Shot 2018-08-16 at 4.26.19 PM.png, 
> Screen Shot 2018-08-17 at 1.03.35 AM.png, Screen Shot 2018-08-17 at 1.04.32 
> AM.png, Screen Shot 2018-08-17 at 1.05.30 AM.png
>
>
> We are testing secured writing to kafka through ssl. Testing at small scale, 
> ssl writing to kafka was fine. However, when we enabled ssl writing at a 
> larger scale (>40k clients write concurrently), the kafka brokers soon hit 
> OutOfMemory issue with 4G memory setting. We have tried with increasing the 
> heap size to 10Gb, but encountered the same issue. 
> We took a few heap dumps , and found that most of the heap memory is 
> referenced through org.apache.kafka.common.network.Selector objects.  There 
> are two Channel maps field in Selector. It seems that somehow the objects is 
> not deleted from the map in a timely manner. 
> One observation is that the memory leak seems relate to kafka partition 
> leader changes. If there is broker restart etc. in the cluster that caused 
> partition leadership change, the brokers may hit the OOM issue faster. 
> {code}
>     private final Map<String, KafkaChannel> channels;
>     private final Map<String, KafkaChannel> closingChannels;
> {code}
> Please see the  attached images and the following link for sample gc 
> analysis. 
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0
> the command line for running kafka: 
> {code}
> java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m 
> -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 
> -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M 
> -Djava.awt.headless=true 
> -Dlog4j.configuration=file:/etc/kafka/log4j.properties 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port=9999 
> -Dcom.sun.management.jmxremote.rmi.port=9999 -cp /usr/local/libs/*  
> kafka.Kafka /etc/kafka/server.properties
> {code}
> We use java 1.8.0_102, and has applied a TLS patch on reducing 
> X509Factory.certCache map size from 750 to 20. 
> {code}
> java -version
> java version "1.8.0_102"
> Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to