Hello,

This is a recurrent behavior of JVM GC in Cassandra that I never completely
understood: when a node is UP for many days (or even months), or receives a
very high load spike (3x-5x normal load), CMS GC pauses start becoming very
frequent and slow, causing periodic timeouts in Cassandra. Trying to run GC
manually doesn't free up memory. The only solution when a node reaches this
state is to restart the node.

We restart the whole cluster every 1 or 2 months, to avoid machines getting
into this crazy state. We tried tuning GC size and parameters, different
cassandra versions (1.1, 1.2, 2.0), but this behavior keeps happening. More
recently, during black friday, we received about 5x our normal load, and
some machines started presenting this behavior. Once again, we restart the
nodes an the GC behaves normal again.

I'm attaching a few pictures comparing the heap of "healthy" and "sick"
nodes: http://imgur.com/a/Tcr3w

You can clearly notice some memory is actually reclaimed during GC in
healthy nodes, while in sick machines very little memory is reclaimed.
Also, since GC is executed more frequently in sick machines, it uses about
2x more CPU than non-sick nodes.

Have you ever observed this behavior in your cluster? Could this be related
to heap fragmentation? Would using the G1 collector help in this case? Any
GC tuning or monitoring advice to troubleshoot this issue?

Any advice or pointers will be kindly appreciated.

Cheers,

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br <http://www.chaordic.com.br/>*
+55 48 3232.3200

Reply via email to