Hey Garo, 

I see you are using 2.2.x. Do you have compression enabled on commit logs by 
any chance? If so, try to disable it and see if that helps.

Right after migrating from 2.1.x to 2.2.x, I remember having the behavior you 
described with 10k SAS disks when commit log compression was enabled w/ LZ4.  
After disabling compression on commit logs the issue was gone on my side.

   J. 

--
Julien Anguenot (@anguenot)

> On Jul 22, 2016, at 2:10 PM, Juho Mäkinen <juho.maki...@gmail.com> wrote:
> 
> After a few days I've also tried disabling Linux kernel huge pages 
> defragement (echo never > /sys/kernel/mm/transparent_hugepage/defrag) and 
> turning coalescing off (otc_coalescing_strategy: DISABLED), but either did do 
> any good. I'm using LCS, there are no big GC pauses, and I have set 
> "concurrent_compactors: 5" (machines have 16 CPUs), but there are usually not 
> any compactions running when the load spike comes. "nodetool tpstats" shows 
> no running thread pools except on the Native-Transport-Requests (usually 0-4) 
> and perhaps ReadStage (usually 0-1).
> 
> The symptoms are the same: after about 12-24 hours increasingly number of 
> nodes start to show short CPU load spikes and this affects the median read 
> latencies. I ran a dstat when a load spike was already under way (see 
> screenshot http://i.imgur.com/B0S5Zki.png <http://i.imgur.com/B0S5Zki.png>), 
> but any other column than the load itself doesn't show any major change 
> except the system/kernel CPU usage.
> 
> All further ideas how to debug this are greatly appreciated.
> 
> 
> On Wed, Jul 20, 2016 at 7:13 PM, Juho Mäkinen <juho.maki...@gmail.com 
> <mailto:juho.maki...@gmail.com>> wrote:
> I just recently upgraded our cluster to 2.2.7 and after turning the cluster 
> under production load the instances started to show high load (as shown by 
> uptime) without any apparent reason and I'm not quite sure what could be 
> causing it.
> 
> We are running on i2.4xlarge, so we have 16 cores, 120GB of ram, four 800GB 
> SSDs (set as lvm stripe into one big lvol). Running 3.13.0-87-generic on HVM 
> virtualisation. Cluster has 26 TiB of data stored in two tables.
> 
> Symptoms:
>  - High load, sometimes up to 30 for a short duration of few minutes, then 
> the load drops back to the cluster average: 3-4
>  - Instances might have one compaction running, but might not have any 
> compactions.
>  - Each node is serving around 250-300 reads per second and around 200 writes 
> per second.
>  - Restarting node fixes the problem for around 18-24 hours.
>  - No or very little IO-wait.
>  - top shows that around 3-10 threads are running on high cpu, but that alone 
> should not cause a load of 20-30.
>  - Doesn't seem to be GC load: A system starts to show symptoms so that it 
> has ran only one CMS sweep. Not like it would do constant stop-the-world gc's.
>  - top shows that the C* processes use 100G of RSS memory. I assume that this 
> is because cassandra opens all SSTables with mmap() so that they will pop up 
> in the RSS count because of this.
> 
> What I've done so far:
>  - Rolling restart. Helped for about one day.
>  - Tried doing manual GC to the cluster.
>  - Increased heap from 8 GiB with CMS to 16 GiB with G1GC.
>  - sjk-plus shows bunch of SharedPool workers. Not sure what to make of this.
>  - Browsed over 
> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html 
> <https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html> but 
> didn't find any apparent 
> 
> I know that the general symptom of "system shows high load" is not very good 
> and informative, but I don't know how to better describe what's going on. I 
> appreciate all ideas what to try and how to debug this further.
> 
>  - Garo
> 
> 

Reply via email to