[
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050677#comment-14050677
]
Rick Branson commented on CASSANDRA-7486:
-----------------------------------------
Mad anecdotes:
We ran with G1 enabled for around 4 days in a 33-node cluster running 1.2.17 on
JDK7u45 that has around a 1:5 read:write ratio. We tried a few different
configurations with short durations, but most of the time we ran it with the
out-of-the-box G1 configuration on a 20G heap and 32 parallel GC threads (16
core, 32 hyperthreaded). There were some somewhat scary bugs fixed in 7u60 that
ultimately caused me to roll back to the CMS collector after the experiment.
* The experiment pointed out that our young gen was basically too small and was
pulling latency up significantly. When we returned back to CMS, I doubled new
size from 800M -> 1600M. We had moved to new hardware and hadn't taken the time
to sit down and play with GC settings. This cut our mean latency dramatically
as perceived from the client, ~50% for writes and ~30% for reads, similar to
what we saw with G1. I was quite thrilled with this result.
* I tried both 100ms and 150ms pause times targets with 12G, 16G, and 20G
heaps, and while these resulted in slightly lower mean latency (~5-10%), Mixed
GC activity caused P99s to suffer greatly. There's compelling evidence that the
200ms default is nearly ideal for the way the G1 algorithm works in its current
incarnation.
* We basically needed a 20G heap to make G1 work well for us, since by default
G1 will use up to half of the max heap for eden space and Cassandra needs quite
a large old gen to stay happy. G1 appears to need a much larger eden space to
work efficiently, sizes that would make ParNew die in a fire. GCs of the eden
space were impressively fast, with a ~10G eden space taking ~120ms on average
to collect.
* G1's huge eden space was helpful working around some issues with compaction
on hints CF which had dozens of very wide partitions, hundreds of thousands of
cells each.
* Overall, at the default 200ms pause time target, we didn't see much of an
increase in CPU usage over CMS.
In the end, my tests basically told us that G1 requires a larger heap to get
the same results with *far* less tuning. If there are GC issues, it seems like
in the vast majority of cases G1 can either eliminate them or G1 makes it easy
to just workaround them by cranking up the heap size. Someone should probably
test G1 with a variable-sized heap since it's designed to give back RAM when it
thinks it doesn't need it. That might or might not actually work. While we
didn't test this, a configuration of G1 + heap size min of 1/8 RAM and max of
1/2 RAM might make a really nice default for Cassandra at some point.
> Compare CMS and G1 pause times
> ------------------------------
>
> Key: CASSANDRA-7486
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
> Project: Cassandra
> Issue Type: Test
> Components: Config
> Reporter: Jonathan Ellis
> Assignee: Ryan McGuire
> Fix For: 2.1.0
>
>
> See
> http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-gc-migration-to-expectations-and-advanced-tuning
> and https://twitter.com/rbranson/status/482113561431265281
> May want to default 2.1 to G1.
> 2.1 is a different animal from 2.0 after moving most of memtables off heap.
> Suspect this will help G1 even more than CMS. (NB this is off by default but
> needs to be part of the test.)
--
This message was sent by Atlassian JIRA
(v6.2#6252)