OOM at Bootstrap Time

Maxime Sat, 25 Oct 2014 14:24:40 -0700

Hello, I've been trying to add a new node to my cluster ( 4 nodes ) for a
few days now.


I started by adding a node similar to my current configuration, 4 GB or RAM
+ 2 Cores on DigitalOcean. However every time, I would end up getting OOM
errors after many log entries of the type:

INFO  [SlabPoolCleaner] 2014-10-25 13:44:57,240 ColumnFamilyStore.java:856
- Enqueuing flush of mycf: 5383 (0%) on-heap, 0 (0%) off-heap

leading to:

ka-120-Data.db (39291 bytes) for commitlog position
ReplayPosition(segmentId=1414243978538, position=23699418)
WARN  [SharedPool-Worker-13] 2014-10-25 13:48:18,032
AbstractTracingAwareExecutorService.java:167 - Uncaught exception on thread
Thread[SharedPool-Worker-13,5,main]: {}
java.lang.OutOfMemoryError: Java heap space

Thinking it had to do with either compaction somehow or streaming, 2
activities I've had tremendous issues with in the past; I tried to slow
down the setstreamthroughput to extremely low values all the way to 5. I
also tried setting setcompactionthoughput to 0, and then reading that in
some cases it might be too fast, down to 8. Nothing worked, it merely
vaguely changed the mean time to OOM but not in a way indicating either was
anywhere a solution.

The nodes were configured with 2 GB of Heap initially, I tried to crank it
up to 3 GB, stressing the host memory to its limit.

After doing some exploration (I am considering writing a Cassandra Ops
documentation with lessons learned since there seems to be little of it in
organized fashions), I read that some people had strange issues on
lower-end boxes like that, so I bit the bullet and upgraded my new node to
a 8GB + 4 Core instance, which was anecdotally better.

To my complete shock, exact same issues are present, even raising the Heap
memory to 6 GB. I figure it can't be a "normal" situation anymore, but must
be a bug somehow.

My cluster is 4 nodes, RF of 2, about 160 GB of data across all nodes.
About 10 CF of varying sizes. Runtime writes are between 300 to 900 /
second. Cassandra 2.1.0, nothing too wild.

Has anyone encountered these kinds of issues before? I would really enjoy
hearing about the experiences of people trying to run small-sized clusters
like mine. From everything I read, Cassandra operations go very well on
large (16 GB + 8 Cores) machines, but I'm sad to report I've had nothing
but trouble trying to run on smaller machines, perhaps I can learn from
other's experience?

Full logs can be provided to anyone interested.

Cheers

OOM at Bootstrap Time

Reply via email to