I just recently upgraded our cluster to 2.2.7 and after turning the cluster
under production load the instances started to show high load (as shown by
uptime) without any apparent reason and I'm not quite sure what could be
causing it.

We are running on i2.4xlarge, so we have 16 cores, 120GB of ram, four 800GB
SSDs (set as lvm stripe into one big lvol). Running 3.13.0-87-generic on
HVM virtualisation. Cluster has 26 TiB of data stored in two tables.

Symptoms:
 - High load, sometimes up to 30 for a short duration of few minutes, then
the load drops back to the cluster average: 3-4
 - Instances might have one compaction running, but might not have any
compactions.
 - Each node is serving around 250-300 reads per second and around 200
writes per second.
 - Restarting node fixes the problem for around 18-24 hours.
 - No or very little IO-wait.
 - top shows that around 3-10 threads are running on high cpu, but that
alone should not cause a load of 20-30.
 - Doesn't seem to be GC load: A system starts to show symptoms so that it
has ran only one CMS sweep. Not like it would do constant stop-the-world
gc's.
 - top shows that the C* processes use 100G of RSS memory. I assume that
this is because cassandra opens all SSTables with mmap() so that they will
pop up in the RSS count because of this.

What I've done so far:
 - Rolling restart. Helped for about one day.
 - Tried doing manual GC to the cluster.
 - Increased heap from 8 GiB with CMS to 16 GiB with G1GC.
 - sjk-plus shows bunch of SharedPool workers. Not sure what to make of
this.
 - Browsed over
https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html but
didn't find any apparent

I know that the general symptom of "system shows high load" is not very
good and informative, but I don't know how to better describe what's going
on. I appreciate all ideas what to try and how to debug this further.

 - Garo

Reply via email to