Hi, I'm looking for some inputs in debugging a high memory usage issue (and subsequently the process being killed) in one of the applications I deal with. Given that from what I have looked into this issue so far, this appears to be something to do with the CMS collector, so I hope this is the right place to this question.
A bit of a background - The application that I'm dealing with is ElasticSearch server version 1.7.5. We use Java 8: java version "1.8.0_172" Java(TM) SE Runtime Environment (build 1.8.0_172-b11) Java HotSpot(TM) 64-Bit Server VM (build 25.172-b11, mixed mode) To add to the complexity in debugging this issue, this runs as a docker container on docker version 18.03.0-ce on a CentOS 7 host VM kernel version 3.10.0-693.5.2.el7.x86_64. We have been noticing that this container/process keeps getting killed by the oom-killer every few days. The dmesg logs suggest that the process has hit the "limits" set on the docker cgroups level. After debugging this over past day or so, I've reached a point where I can't make much sense of the data I'm looking at. The JVM process is started using the following params (of relevance): java -Xms2G -Xmx6G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC .... As you can see it uses CMS collector with 75% of tenured/old gen for initiating the GC. After a few hours/days of running I notice that even though the CMS collector does run almost every hour or so, there are huge number of objects _with no GC roots_ that never get collected. These objects internally seem to hold on to ByteBuffer(s) which (from what I see) as a result never get released and the non-heap memory keeps building up, till the process gets killed. To give an example, here's the jmap -histo output (only relevant parts): 1: 861642 196271400 [B 2: 198776 28623744 org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame 3: 676722 21655104 org.apache.lucene.store.ByteArrayDataInput 4: 202398 19430208 org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTermState 5: 261819 18850968 org.apache.lucene.util.fst.FST$Arc 6: 178661 17018376 [C 7: 31452 16856024 [I 8: 203911 8049352 [J 9: 85700 5484800 java.nio.DirectByteBufferR 10: 168935 5405920 java.util.concurrent.ConcurrentHashMap$Node 11: 89948 5105328 [Ljava.lang.Object; 12: 148514 4752448 org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference .... Total 5061244 418712248 This above output is without the "live" option. Running jmap -histo:live returns something like (again only relevant parts): 13: 31753 1016096 org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference ... 44: 887 127728 org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame ... 50: 3054 97728 org.apache.lucene.store.ByteArrayDataInput ... 59: 888 85248 org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTermState Total 1177783 138938920 Notice the vast difference between the live and non-live instances of the same class. This isn't just in one "snapshot". I have been monitoring this for more than a day and this pattern continues. Even taking heap dumps and using tools like visualvm shows that these instances have "no GC root" and I have even checked the gc log files to see that the CMS collector does occasionally run. However these objects never seem to get collected. I realize this data may not be enough to narrow down the issue, but what I am looking for is some kind of help/input/hints/suggestions on what I should be trying to figure out why these instances aren't GCed. Is this something that's expected in certain situations? -Jaikiran _______________________________________________ hotspot-gc-use mailing list hotspot-gc-use@openjdk.java.net http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use