Hi, we've seen G1GC going OOM on production clusters (repeatedly) with a 16GB heap when the workload is intense, and given you're running on m4.2xl I wouldn't go over 16GB for the heap.
I'd suggest to revert back to CMS, using a 16GB heap and up to 6GB of new gen. You can use 5 as MaxTenuringThreshold as an initial value and activate GC logging to fine tune the settings afterwards. FYI CMS tends to perform better than G1 even though it's a little bit harder to tune. Cheers, On Mon, Apr 3, 2017 at 10:54 PM Gopal, Dhruva <dhruva.go...@aspect.com> wrote: > 16 Gig heap, with G1. Pertinent info from jvm.options below (we’re using > m2.2xlarge instances in AWS): > > > > > > ################# > > # HEAP SETTINGS # > > ################# > > > > # Heap size is automatically calculated by cassandra-env based on this > > # formula: max(min(1/2 ram, 1024MB), min(1/4 ram, 8GB)) > > # That is: > > # - calculate 1/2 ram and cap to 1024MB > > # - calculate 1/4 ram and cap to 8192MB > > # - pick the max > > # > > # For production use you may wish to adjust this for your environment. > > # If that's the case, uncomment the -Xmx and Xms options below to override > the > > # automatic calculation of JVM heap memory. > > # > > # It is recommended to set min (-Xms) and max (-Xmx) heap sizes to > > # the same value to avoid stop-the-world GC pauses during resize, and > > # so that we can lock the heap in memory on startup to prevent any > > # of it from being swapped out. > > -Xms16G > > -Xmx16G > > > > # Young generation size is automatically calculated by cassandra-env > > # based on this formula: min(100 * num_cores, 1/4 * heap size) > > # > > # The main trade-off for the young generation is that the larger it > > # is, the longer GC pause times will be. The shorter it is, the more > > # expensive GC will be (usually). > > # > > # It is not recommended to set the young generation size if using the > > # G1 GC, since that will override the target pause-time goal. > > # More info: > http://www.oracle.com/technetwork/articles/java/g1gc-1984535.html > > # > > # The example below assumes a modern 8-core+ machine for decent > > # times. If in doubt, and if you do not particularly want to tweak, go > > # 100 MB per physical CPU core. > > #-Xmn800M > > > > ################# > > # GC SETTINGS # > > ################# > > > > ### CMS Settings > > > > #-XX:+UseParNewGC > > #-XX:+UseConcMarkSweepGC > > #-XX:+CMSParallelRemarkEnabled > > #-XX:SurvivorRatio=8 > > #-XX:MaxTenuringThreshold=1 > > #-XX:CMSInitiatingOccupancyFraction=75 > > #-XX:+UseCMSInitiatingOccupancyOnly > > #-XX:CMSWaitDuration=10000 > > #-XX:+CMSParallelInitialMarkEnabled > > #-XX:+CMSEdenChunksRecordAlways > > # some JVMs will fill up their heap when accessed via JMX, see > CASSANDRA-6541 > > #-XX:+CMSClassUnloadingEnabled > > > > ### G1 Settings (experimental, comment previous section and uncomment > section below to enable) > > > > ## Use the Hotspot garbage-first collector. > > -XX:+UseG1GC > > # > > ## Have the JVM do less remembered set work during STW, instead > > ## preferring concurrent GC. Reduces p99.9 latency. > > -XX:G1RSetUpdatingPauseTimePercent=5 > > # > > ## Main G1GC tunable: lowering the pause target will lower throughput and > vise versa. > > ## 200ms is the JVM default and lowest viable setting > > ## 1000ms increases throughput. Keep it smaller than the timeouts in > cassandra.yaml. > > -XX:MaxGCPauseMillis=500 > > > > ## Optional G1 Settings > > > > # Save CPU time on large (>= 16GB) heaps by delaying region scanning > > # until the heap is 70% full. The default in Hotspot 8u40 is 40%. > > -XX:InitiatingHeapOccupancyPercent=70 > > > > # For systems with > 8 cores, the default ParallelGCThreads is 5/8 the > number of logical cores. > > # Otherwise equal to the number of cores when 8 or less. > > # Machines with > 10 cores should try setting these to <= full cores. > > #-XX:ParallelGCThreads=16 > > # By default, ConcGCThreads is 1/4 of ParallelGCThreads. > > # Setting both to the same value can reduce STW durations. > > #-XX:ConcGCThreads=16 > > > > ### GC logging options -- uncomment to enable > > > > #-XX:+PrintGCDetails > > #-XX:+PrintGCDateStamps > > #-XX:+PrintHeapAtGC > > #-XX:+PrintTenuringDistribution > > #-XX:+PrintGCApplicationStoppedTime > > #-XX:+PrintPromotionFailure > > #-XX:PrintFLSStatistics=1 > > #-Xloggc:/var/log/cassandra/gc.log > > #-XX:+UseGCLogFileRotation > > #-XX:NumberOfGCLogFiles=10 > > #-XX:GCLogFileSize=10M > > > > > > *From: *Alexander Dejanovski <a...@thelastpickle.com> > *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org> > *Date: *Monday, April 3, 2017 at 8:00 AM > *To: *"user@cassandra.apache.org" <user@cassandra.apache.org> > *Subject: *Re: cassandra OOM > > > > Hi, > > > > could you share your GC settings ? G1 or CMS ? Heap size, etc... > > > > Thanks, > > > > On Sun, Apr 2, 2017 at 10:30 PM Gopal, Dhruva <dhruva.go...@aspect.com> > wrote: > > Hi – > > We’ve had what looks like an OOM situation with Cassandra (we have a > dump file that got generated) in our staging (performance/load testing > environment) and I wanted to reach out to this user group to see if you had > any recommendations on how we should approach our investigation as to the > cause of this issue. The logs don’t seem to point to any obvious issues, > and we’re no experts in analyzing this by any means, so was looking for > guidance on how to proceed. Should we enter a Jira as well? We’re on > Cassandra 3.9, and are running a six node cluster. This happened in a > controlled load testing environment. Feedback will be much appreciated! > > > > > > Regards, > > Dhruva > > > > This email (including any attachments) is proprietary to Aspect Software, > Inc. and may contain information that is confidential. If you have received > this message in error, please do not read, copy or forward this message. > Please notify the sender immediately, delete it from your system and > destroy any copies. You may not further disclose or distribute this email > or its attachments. > > -- > > ----------------- > > Alexander Dejanovski > > France > > @alexanderdeja > > > > Consultant > > Apache Cassandra Consulting > > http://www.thelastpickle.com > This email (including any attachments) is proprietary to Aspect Software, > Inc. and may contain information that is confidential. If you have received > this message in error, please do not read, copy or forward this message. > Please notify the sender immediately, delete it from your system and > destroy any copies. You may not further disclose or distribute this email > or its attachments. > -- ----------------- Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com