I should note that the young gen size is just a tuning suggestion, not directly related to your problem at hand.
You might want to make sure you don’t have issues with key/row cache. Also, I’m assuming that your extra load isn’t hitting tables that you wouldn’t normally be hitting. > On Nov 28, 2014, at 6:54 PM, graham sanderson <gra...@vast.com> wrote: > > Your GC settings would be helpful, though you can see guesstimate by > eyeballing (assuming settings are the same across all 4 images) > > Bursty load can be a big cause of old gen fragmentation (as small working set > objects tends to get spilled (promoted) along with memtable slabs which > aren’t flushed quickly enough). That said, empty fragmentation holes wouldn’t > show up as “used” in your graph, and that clearly looks like you are above > your CMSIniatingOccupancyFraction and CMS is running continuously, so they > probably aren’t the issue here. > > Other than trying a slightly larger heap to give you more head room, I’d also > suggest from eyeballing that you have probably let the JVM pick its own new > gen size, and I’d suggest it is too small. What to set it to really depends > on your workload, but you could try something in the 0.5gig range unless that > makes your young gen pauses too long. In that case (or indeed anyway) make > sure you also have the latest GC settings (e.g. > -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways) on newer > JVMs (to help the young gc pauses) > >> On Nov 28, 2014, at 2:55 PM, Paulo Ricardo Motta Gomes >> <paulo.mo...@chaordicsystems.com <mailto:paulo.mo...@chaordicsystems.com>> >> wrote: >> >> Hello, >> >> This is a recurrent behavior of JVM GC in Cassandra that I never completely >> understood: when a node is UP for many days (or even months), or receives a >> very high load spike (3x-5x normal load), CMS GC pauses start becoming very >> frequent and slow, causing periodic timeouts in Cassandra. Trying to run GC >> manually doesn't free up memory. The only solution when a node reaches this >> state is to restart the node. >> >> We restart the whole cluster every 1 or 2 months, to avoid machines getting >> into this crazy state. We tried tuning GC size and parameters, different >> cassandra versions (1.1, 1.2, 2.0), but this behavior keeps happening. More >> recently, during black friday, we received about 5x our normal load, and >> some machines started presenting this behavior. Once again, we restart the >> nodes an the GC behaves normal again. >> >> I'm attaching a few pictures comparing the heap of "healthy" and "sick" >> nodes: http://imgur.com/a/Tcr3w <http://imgur.com/a/Tcr3w> >> >> You can clearly notice some memory is actually reclaimed during GC in >> healthy nodes, while in sick machines very little memory is reclaimed. Also, >> since GC is executed more frequently in sick machines, it uses about 2x more >> CPU than non-sick nodes. >> >> Have you ever observed this behavior in your cluster? Could this be related >> to heap fragmentation? Would using the G1 collector help in this case? Any >> GC tuning or monitoring advice to troubleshoot this issue? >> >> Any advice or pointers will be kindly appreciated. >> >> Cheers, >> >> -- >> Paulo Motta >> >> Chaordic | Platform >> www.chaordic.com.br <http://www.chaordic.com.br/> >> +55 48 3232.3200 >
smime.p7s
Description: S/MIME cryptographic signature