Vincent,

only the 2.68GB partition is out of bounds here, all the others (<256MB)
shouldn't be much of a problem.
It could put pressure on your heap if it is often read and/or compacted.
But to answer your question about the 1% harming the cluster, a few big
partitions can definitely be a big problem depending on your access
patterns.
Which compaction strategy are you using on this table ?

Could you provide/check the following things on a node that crashed
recently :

   - Hardware specifications (how many cores ? how much RAM ? Bare metal or
   VMs ?)
   - Java version
   - GC pauses throughout a day (grep GCInspector
   /var/log/cassandra/system.log) : check if you have many pauses that take
   more than 1 second
   - GC logs at the time of a crash (if you don't produce any, you should
   activate them in cassandra-env.sh)
   - Tombstone warnings in the logs and high number of tombstone read in
   cfstats
   - Make sure swap is disabled


Cheers,


On Mon, Nov 21, 2016 at 2:57 PM Vincent Rischmann <m...@vrischmann.me> wrote:

@Vladimir

We tried with 12Gb and 16Gb, the problem appeared eventually too.
In this particular cluster we have 143 tables across 2 keyspaces.

@Alexander

We have one table with a max partition of 2.68GB, one of 256 MB, a bunch
with the size varying between 10MB to 100MB ~. Then there's the rest with
the max lower than 10MB.

On the biggest, the 99% is around 60MB, 98% around 25MB, 95% around 5.5MB.
On the one with max of 256MB, the 99% is around 4.6MB, 98% around 2MB.

Could the 1% here really have that much impact ? We do write a lot to the
biggest table and read quite often too, however I have no way to know if
that big partition is ever read.


On Mon, Nov 21, 2016, at 01:09 PM, Alexander Dejanovski wrote:

Hi Vincent,

one of the usual causes of OOMs is very large partitions.
Could you check your nodetool cfstats output in search of large partitions
? If you find one (or more), run nodetool cfhistograms on those tables to
get a view of the partition sizes distribution.

Thanks

On Mon, Nov 21, 2016 at 12:01 PM Vladimir Yudovin <vla...@winguzone.com>
wrote:


Did you try any value in the range 8-20 (e.g. 60-70% of physical memory).
Also how many tables do you have across all keyspaces? Each table can
consume minimum 1M of Java heap.

Best regards, Vladimir Yudovin,

*Winguzone <https://winguzone.com?from=list> - Hosted Cloud CassandraLaunch
your cluster in minutes.*


---- On Mon, 21 Nov 2016 05:13:12 -0500*Vincent Rischmann <m...@vrischmann.me
<m...@vrischmann.me>>* wrote ----

Hello,

we have a 8 node Cassandra 2.1.15 cluster at work which is giving us a lot
of trouble lately.

The problem is simple: nodes regularly die because of an out of memory
exception or the Linux OOM killer decides to kill the process.
For a couple of weeks now we increased the heap to 20Gb hoping it would
solve the out of memory errors, but in fact it didn't; instead of getting
out of memory exception the OOM killer killed the JVM.

We reduced the heap on some nodes to 8Gb to see if it would work better,
but some nodes crashed again with out of memory exception.

I suspect some of our tables are badly modelled, which would cause
Cassandra to allocate a lot of data, however I don't how to prove that
and/or find which table is bad, and which query is responsible.

I tried looking at metrics in JMX, and tried profiling using mission
control but it didn't really help; it's possible I missed it because I have
no idea what to look for exactly.

Anyone have some advice for troubleshooting this ?

Thanks.

-- 
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


-- 
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Reply via email to