Hello, This is a recurrent behavior of JVM GC in Cassandra that I never completely understood: when a node is UP for many days (or even months), or receives a very high load spike (3x-5x normal load), CMS GC pauses start becoming very frequent and slow, causing periodic timeouts in Cassandra. Trying to run GC manually doesn't free up memory. The only solution when a node reaches this state is to restart the node.
We restart the whole cluster every 1 or 2 months, to avoid machines getting into this crazy state. We tried tuning GC size and parameters, different cassandra versions (1.1, 1.2, 2.0), but this behavior keeps happening. More recently, during black friday, we received about 5x our normal load, and some machines started presenting this behavior. Once again, we restart the nodes an the GC behaves normal again. I'm attaching a few pictures comparing the heap of "healthy" and "sick" nodes: http://imgur.com/a/Tcr3w You can clearly notice some memory is actually reclaimed during GC in healthy nodes, while in sick machines very little memory is reclaimed. Also, since GC is executed more frequently in sick machines, it uses about 2x more CPU than non-sick nodes. Have you ever observed this behavior in your cluster? Could this be related to heap fragmentation? Would using the G1 collector help in this case? Any GC tuning or monitoring advice to troubleshoot this issue? Any advice or pointers will be kindly appreciated. Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br <http://www.chaordic.com.br/>* +55 48 3232.3200