Re: Nodes get stuck in crazy GC loop after some time, leading to timeouts

graham sanderson Fri, 28 Nov 2014 16:56:37 -0800

Your GC settings would be helpful, though you can see guesstimate by eyeballing 
(assuming settings are the same across all 4 images)


Bursty load can be a big cause of old gen fragmentation (as small working set 
objects tends to get spilled (promoted) along with memtable slabs which aren’t 
flushed quickly enough). That said, empty fragmentation holes wouldn’t show up 
as “used” in your graph, and that clearly looks like you are above your 
CMSIniatingOccupancyFraction and CMS is running continuously, so they probably 
aren’t the issue here.

Other than trying a slightly larger heap to give you more head room, I’d also 
suggest from eyeballing that you have probably let the JVM pick its own new gen 
size, and I’d suggest it is too small. What to set it to really depends on your 
workload, but you could try something in the 0.5gig range unless that makes 
your young gen pauses too long. In that case (or indeed anyway) make sure you 
also have the latest GC settings (e.g. -XX:+CMSParallelInitialMarkEnabled 
-XX:+CMSEdenChunksRecordAlways) on newer JVMs (to help the young gc pauses)

> On Nov 28, 2014, at 2:55 PM, Paulo Ricardo Motta Gomes 
> <paulo.mo...@chaordicsystems.com> wrote:
> 
> Hello,
> 
> This is a recurrent behavior of JVM GC in Cassandra that I never completely 
> understood: when a node is UP for many days (or even months), or receives a 
> very high load spike (3x-5x normal load), CMS GC pauses start becoming very 
> frequent and slow, causing periodic timeouts in Cassandra. Trying to run GC 
> manually doesn't free up memory. The only solution when a node reaches this 
> state is to restart the node.
> 
> We restart the whole cluster every 1 or 2 months, to avoid machines getting 
> into this crazy state. We tried tuning GC size and parameters, different 
> cassandra versions (1.1, 1.2, 2.0), but this behavior keeps happening. More 
> recently, during black friday, we received about 5x our normal load, and some 
> machines started presenting this behavior. Once again, we restart the nodes 
> an the GC behaves normal again.
> 
> I'm attaching a few pictures comparing the heap of "healthy" and "sick" 
> nodes: http://imgur.com/a/Tcr3w <http://imgur.com/a/Tcr3w>
> 
> You can clearly notice some memory is actually reclaimed during GC in healthy 
> nodes, while in sick machines very little memory is reclaimed. Also, since GC 
> is executed more frequently in sick machines, it uses about 2x more CPU than 
> non-sick nodes.
> 
> Have you ever observed this behavior in your cluster? Could this be related 
> to heap fragmentation? Would using the G1 collector help in this case? Any GC 
> tuning or monitoring advice to troubleshoot this issue?
> 
> Any advice or pointers will be kindly appreciated.
> 
> Cheers,
> 
> -- 
> Paulo Motta
> 
> Chaordic | Platform
> www.chaordic.com.br <http://www.chaordic.com.br/>
> +55 48 3232.3200

smime.p7s
Description: S/MIME cryptographic signature

Re: Nodes get stuck in crazy GC loop after some time, leading to timeouts

Reply via email to