You aren't using counters by chance?

regards,

Ryan Svihla

On Jul 22, 2016, 2:00 PM -0500, Mark Rose <markr...@markrose.ca>, wrote:
> Hi Garo,
>
> Are you using XFS or Ext4 for data? XFS is much better at deleting
> large files, such as may happen after a compaction. If you have 26 TB
> in just two tables, I bet you have some massive sstables which may
> take a while for Ext4 to delete, which may be causing the stalls. The
> underlying block layers will not show high IO-wait. See if the stall
> times line up with large compactions in system.log.
>
> If you must use Ext4, another way to avoid issues with massive
> sstables is to run more, smaller instances.
>
> As an aside, for the amount of reads/writes you're doing, I've found
> using c3/m3 instances with the commit log on the ephemeral storage and
> data on st1 EBS volumes to be much more cost effective. It's something
> to look into if you haven't already.
>
> -Mark
>
> On Fri, Jul 22, 2016 at 8:10 AM, Juho Mäkinen <juho.maki...@gmail.com> wrote:
> > After a few days I've also tried disabling Linux kernel huge pages
> > defragement (echo never > /sys/kernel/mm/transparent_hugepage/defrag) and
> > turning coalescing off (otc_coalescing_strategy: DISABLED), but either did
> > do any good. I'm using LCS, there are no big GC pauses, and I have set
> > "concurrent_compactors: 5" (machines have 16 CPUs), but there are usually
> > not any compactions running when the load spike comes. "nodetool tpstats"
> > shows no running thread pools except on the Native-Transport-Requests
> > (usually 0-4) and perhaps ReadStage (usually 0-1).
> >
> > The symptoms are the same: after about 12-24 hours increasingly number of
> > nodes start to show short CPU load spikes and this affects the median read
> > latencies. I ran a dstat when a load spike was already under way (see
> > screenshot http://i.imgur.com/B0S5Zki.png), but any other column than the
> > load itself doesn't show any major change except the system/kernel CPU
> > usage.
> >
> > All further ideas how to debug this are greatly appreciated.
> >
> >
> > On Wed, Jul 20, 2016 at 7:13 PM, Juho Mäkinen <juho.maki...@gmail.com
> > wrote:
> > >
> > > I just recently upgraded our cluster to 2.2.7 and after turning the
> > > cluster under production load the instances started to show high load (as
> > > shown by uptime) without any apparent reason and I'm not quite sure what
> > > could be causing it.
> > >
> > > We are running on i2.4xlarge, so we have 16 cores, 120GB of ram, four
> > > 800GB SSDs (set as lvm stripe into one big lvol). Running 
> > > 3.13.0-87-generic
> > > on HVM virtualisation. Cluster has 26 TiB of data stored in two tables.
> > >
> > > Symptoms:
> > > - High load, sometimes up to 30 for a short duration of few minutes, then
> > > the load drops back to the cluster average: 3-4
> > > - Instances might have one compaction running, but might not have any
> > > compactions.
> > > - Each node is serving around 250-300 reads per second and around 200
> > > writes per second.
> > > - Restarting node fixes the problem for around 18-24 hours.
> > > - No or very little IO-wait.
> > > - top shows that around 3-10 threads are running on high cpu, but that
> > > alone should not cause a load of 20-30.
> > > - Doesn't seem to be GC load: A system starts to show symptoms so that it
> > > has ran only one CMS sweep. Not like it would do constant stop-the-world
> > > gc's.
> > > - top shows that the C* processes use 100G of RSS memory. I assume that
> > > this is because cassandra opens all SSTables with mmap() so that they will
> > > pop up in the RSS count because of this.
> > >
> > > What I've done so far:
> > > - Rolling restart. Helped for about one day.
> > > - Tried doing manual GC to the cluster.
> > > - Increased heap from 8 GiB with CMS to 16 GiB with G1GC.
> > > - sjk-plus shows bunch of SharedPool workers. Not sure what to make of
> > > this.
> > > - Browsed over
> > > https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html but 
> > > didn't
> > > find any apparent
> > >
> > > I know that the general symptom of "system shows high load" is not very
> > > good and informative, but I don't know how to better describe what's going
> > > on. I appreciate all ideas what to try and how to debug this further.
> > >
> > > - Garo
> > >
> >

Reply via email to