You aren't using counters by chance? regards,
Ryan Svihla On Jul 22, 2016, 2:00 PM -0500, Mark Rose <markr...@markrose.ca>, wrote: > Hi Garo, > > Are you using XFS or Ext4 for data? XFS is much better at deleting > large files, such as may happen after a compaction. If you have 26 TB > in just two tables, I bet you have some massive sstables which may > take a while for Ext4 to delete, which may be causing the stalls. The > underlying block layers will not show high IO-wait. See if the stall > times line up with large compactions in system.log. > > If you must use Ext4, another way to avoid issues with massive > sstables is to run more, smaller instances. > > As an aside, for the amount of reads/writes you're doing, I've found > using c3/m3 instances with the commit log on the ephemeral storage and > data on st1 EBS volumes to be much more cost effective. It's something > to look into if you haven't already. > > -Mark > > On Fri, Jul 22, 2016 at 8:10 AM, Juho Mäkinen <juho.maki...@gmail.com> wrote: > > After a few days I've also tried disabling Linux kernel huge pages > > defragement (echo never > /sys/kernel/mm/transparent_hugepage/defrag) and > > turning coalescing off (otc_coalescing_strategy: DISABLED), but either did > > do any good. I'm using LCS, there are no big GC pauses, and I have set > > "concurrent_compactors: 5" (machines have 16 CPUs), but there are usually > > not any compactions running when the load spike comes. "nodetool tpstats" > > shows no running thread pools except on the Native-Transport-Requests > > (usually 0-4) and perhaps ReadStage (usually 0-1). > > > > The symptoms are the same: after about 12-24 hours increasingly number of > > nodes start to show short CPU load spikes and this affects the median read > > latencies. I ran a dstat when a load spike was already under way (see > > screenshot http://i.imgur.com/B0S5Zki.png), but any other column than the > > load itself doesn't show any major change except the system/kernel CPU > > usage. > > > > All further ideas how to debug this are greatly appreciated. > > > > > > On Wed, Jul 20, 2016 at 7:13 PM, Juho Mäkinen <juho.maki...@gmail.com > > wrote: > > > > > > I just recently upgraded our cluster to 2.2.7 and after turning the > > > cluster under production load the instances started to show high load (as > > > shown by uptime) without any apparent reason and I'm not quite sure what > > > could be causing it. > > > > > > We are running on i2.4xlarge, so we have 16 cores, 120GB of ram, four > > > 800GB SSDs (set as lvm stripe into one big lvol). Running > > > 3.13.0-87-generic > > > on HVM virtualisation. Cluster has 26 TiB of data stored in two tables. > > > > > > Symptoms: > > > - High load, sometimes up to 30 for a short duration of few minutes, then > > > the load drops back to the cluster average: 3-4 > > > - Instances might have one compaction running, but might not have any > > > compactions. > > > - Each node is serving around 250-300 reads per second and around 200 > > > writes per second. > > > - Restarting node fixes the problem for around 18-24 hours. > > > - No or very little IO-wait. > > > - top shows that around 3-10 threads are running on high cpu, but that > > > alone should not cause a load of 20-30. > > > - Doesn't seem to be GC load: A system starts to show symptoms so that it > > > has ran only one CMS sweep. Not like it would do constant stop-the-world > > > gc's. > > > - top shows that the C* processes use 100G of RSS memory. I assume that > > > this is because cassandra opens all SSTables with mmap() so that they will > > > pop up in the RSS count because of this. > > > > > > What I've done so far: > > > - Rolling restart. Helped for about one day. > > > - Tried doing manual GC to the cluster. > > > - Increased heap from 8 GiB with CMS to 16 GiB with G1GC. > > > - sjk-plus shows bunch of SharedPool workers. Not sure what to make of > > > this. > > > - Browsed over > > > https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html but > > > didn't > > > find any apparent > > > > > > I know that the general symptom of "system shows high load" is not very > > > good and informative, but I don't know how to better describe what's going > > > on. I appreciate all ideas what to try and how to debug this further. > > > > > > - Garo > > > > >