Re: Bottleneck for small inserts?

Eric Pederson Thu, 15 Jun 2017 11:33:53 -0700

Here are a couple of iostat snapshots showing the spikes in disk queue size
(in these cases correlating with spikes in w/s and %util)


Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util

sda               0.00     5.63    0.00    2.33     0.00    63.73
27.31     0.00    0.57   0.41   0.10

sdb               0.00     0.00   48.03 17990.63  3679.73 143925.07
8.18    23.39    1.30   0.01  22.57

dm-0              0.00     0.00    0.00    0.30     0.00     2.40
8.00     0.00    2.00   0.67   0.02

dm-2              0.00     0.00   48.03 17990.63  3679.73 143925.07
8.18    23.56    1.30   0.01  22.83

dm-3              0.00     0.00    0.00    7.67     0.00    61.33
8.00     0.00    0.44   0.10   0.08



Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util

sda               0.00     2.10    0.00    1.33     0.00    27.47
20.60     0.00    0.25   0.18   0.02

sdb               0.00 16309.00  109.43 2714.23  2609.87 152186.40
54.82    11.44    4.05   0.08  23.54

dm-0              0.00     0.00    0.00    0.10     0.00     0.80
8.00     0.00    0.00   0.00   0.00

dm-2              0.00     0.00  109.43 19023.30  2609.87 152186.40
8.09   273.89   14.30   0.01  23.64

dm-3              0.00     0.00    0.00    3.33     0.00    26.67
8.00     0.00    0.25   0.07   0.02


-- Eric

On Wed, Jun 14, 2017 at 11:17 PM, Eric Pederson <eric...@gmail.com> wrote:

> Using cassandra-stress with the out of the box schema I am seeing around
> 140k rows/second throughput using 1 client on each of 3 client machines.
> On the servers:
>
>    - CPU utilization: 43% usr/20% sys, 55%/28%, 70%/10% (the last number
>    is the older box)
>    - Inbound network traffic: 174 Mbps, 190 Mbps, 178 Mbps
>    - Disk writes/sec: ~10k each server
>    - Disk utilization is in the low single digits but spikes up to 50%
>    - Disk queue size is in the low single digits but spikes up into the
>    mid hundreds.  I even saw in the thousands.   I had not noticed this
>    before.
>
> The disk stats come from iostat -xz 1.   Given the low reported
> utilization %s I would not expect to see any disk queue buildup, even low
> single digits.
>
> Going to 2 cassandra-stress clients per machine the throughput dropped to
> 133k rows/sec.
>
>    - CPU utilization: 13% usr/5% sys, 15%/25%, 40%/22% on the older box
>    - Inbound network RX: 100Mbps, 125Mbps, 120Mbps
>    - Disk utilization is a little lower, but with the same spiky behavior
>
> Going to 3 cassandra-stress clients per machine the throughput dropped to
> 110k rows/sec
>
>    - CPU utilization: 15% usr/20% sys,  15%/20%, 40%/20% on the older box
>    - Inbound network RX dropped to 130 Mbps
>    - Disk utilization stayed roughly the same
>
> I noticed that with the standard cassandra-stress schema GC is not an
> issue.   But with my application-specific schema there is a lot of GC on
> the slower box.  Also with the application-specific schema I can't seem to
> get past 36k rows/sec.   The application schema has 64 columns (mostly
> ints) and the key is (date,sequence#).   The standard stress schema has a
> lot fewer columns and no clustering column.
>
> Thanks,
>
>
>
> -- Eric
>
> On Wed, Jun 14, 2017 at 1:47 AM, Eric Pederson <eric...@gmail.com> wrote:
>
>> Shoot - I didn't see that one.  I subscribe to the digest but was
>> focusing on the direct replies and accidentally missed Patrick and Jeff
>> Jirsa's messages.  Sorry about that...
>>
>> I've been using a combination of cassandra-stress, cqlsh COPY FROM and a
>> custom C++ application for my ingestion testing.   My default setting for
>> my custom client application is 96 threads, and then by default I run one
>> client application process on each of 3 machines.  I tried
>> doubling/quadrupling the number of client threads (and doubling/tripling
>> the number of client processes but keeping the threads per process the
>> same) but didn't see any change.   If I recall correctly I started getting
>> timeouts after I went much beyond concurrent_writes which is 384 (for a 48
>> CPU box) - meaning at 500 threads per client machine I started seeing
>> timeouts.    I'll try again to be sure.
>>
>> For the purposes of this conversation I will try to always use
>> cassandra-stress to keep the number of unknowns limited.  I'll will run
>> more cassandra-stress clients tomorrow in line with Patrick's 3-5 per
>> server recommendation.
>>
>> Thanks!
>>
>>
>> -- Eric
>>
>> On Wed, Jun 14, 2017 at 12:40 AM, Jonathan Haddad <j...@jonhaddad.com>
>> wrote:
>>
>>> Did you try adding more client stress nodes as Patrick recommended?
>>>
>>> On Tue, Jun 13, 2017 at 9:31 PM Eric Pederson <eric...@gmail.com> wrote:
>>>
>>>> Scratch that theory - the flamegraphs show that GC is only 3-4% of two
>>>> newer machine's overall processing, compared to 18% on the slow machine.
>>>>
>>>> I took that machine out of the cluster completely and recreated the
>>>> keyspaces.  The ingest tests now run slightly faster (!).   I would have
>>>> expected a linear slowdown since the load is fairly balanced across
>>>> partitions.  GC appears to be the bottleneck in the 3-server
>>>> configuration.  But still in the two-server configuration the
>>>> CPU/disk/network is still not being fully utilized (the closest is CPU at
>>>> ~45% on one ingest test).  nodetool tpstats shows only blips of
>>>> queueing.
>>>>
>>>>
>>>>
>>>>
>>>> -- Eric
>>>>
>>>> On Mon, Jun 12, 2017 at 9:50 PM, Eric Pederson <eric...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all - I wanted to follow up on this.  I'm happy with the throughput
>>>>> we're getting but I'm still curious about the bottleneck.
>>>>>
>>>>> The big thing that sticks out is one of the nodes is logging frequent
>>>>> GCInspector messages: 350-500ms every 3-6 seconds.  All three nodes
>>>>> in the cluster have identical Cassandra configuration, but the node that 
>>>>> is
>>>>> logging frequent GCs is an older machine with slower CPU and SSD.  This
>>>>> node logs frequent GCInspectors both under load and when compacting
>>>>> but otherwise unloaded.
>>>>>
>>>>> My theory is that the other two nodes have similar GC frequency
>>>>> (because they are seeing the same basic load), but because they are faster
>>>>> machines, they don't spend as much time per GC and don't cross the
>>>>> GCInspector threshold.  Does that sound plausible?   nodetool tpstats
>>>>> doesn't show any queueing in the system.
>>>>>
>>>>> Here's flamegraphs from the system when running a cqlsh COPY FROM:
>>>>>
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva01_cars_batch2.svg
>>>>>    
>>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg>
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva02_cars_batch2.svg
>>>>>    
>>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg>
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva03_cars_batch2.svg
>>>>>    
>>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg>
>>>>>
>>>>> The slow node (ultva03) spends disproportional time in GC.
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> -- Eric
>>>>>
>>>>> On Thu, May 25, 2017 at 8:09 PM, Eric Pederson <eric...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Due to a cut and paste error those flamegraphs were a recording of
>>>>>> the whole system, not just Cassandra.    Throughput is approximately 30k
>>>>>> rows/sec.
>>>>>>
>>>>>> Here's the graphs with just the Cassandra PID:
>>>>>>
>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>    /flamegraph_ultva01_sars2.svg
>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>    /flamegraph_ultva02_sars2.svg
>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>    /flamegraph_ultva03_sars2.svg
>>>>>>
>>>>>>
>>>>>> And here's graphs during a cqlsh COPY FROM to the same table, using
>>>>>> real data, MAXBATCHSIZE=2.    Throughput is good at approximately
>>>>>> 110k rows/sec.
>>>>>>
>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>    /flamegraph_ultva01_cars_batch2.svg
>>>>>>    
>>>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg>
>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>    /flamegraph_ultva02_cars_batch2.svg
>>>>>>    
>>>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg>
>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>    /flamegraph_ultva03_cars_batch2.svg
>>>>>>    
>>>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- Eric
>>>>>>
>>>>>> On Thu, May 25, 2017 at 6:44 PM, Eric Pederson <eric...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Totally understood :)
>>>>>>>
>>>>>>> I forgot to mention - I set the /proc/irq/*/smp_affinity mask to
>>>>>>> include all of the CPUs.  Actually most of them were set that way 
>>>>>>> already
>>>>>>> (for example, 0000ffff,ffffffff) - it might be because irqbalanced
>>>>>>> is running.  But for some reason the interrupts are all being handled on
>>>>>>> CPU 0 anyway.
>>>>>>>
>>>>>>> I see this in /var/log/dmesg on the machines:
>>>>>>>
>>>>>>>>
>>>>>>>> Your BIOS has requested that x2apic be disabled.
>>>>>>>> This will leave your machine vulnerable to irq-injection attacks.
>>>>>>>> Use 'intremap=no_x2apic_optout' to override BIOS request.
>>>>>>>> Enabled IRQ remapping in xapic mode
>>>>>>>> x2apic not enabled, IRQ remapping is in xapic mode
>>>>>>>
>>>>>>>
>>>>>>> In a reply to one of the comments, he says:
>>>>>>>
>>>>>>>
>>>>>>> When IO-APIC configured to spread interrupts among all cores, it can
>>>>>>>> handle up to eight cores. If you have more than eight cores, kernel 
>>>>>>>> will
>>>>>>>> not configure IO-APIC to spread interrupts. Thus the trick I described 
>>>>>>>> in
>>>>>>>> the article will not work.
>>>>>>>> Otherwise it may be caused by buggy BIOS or even buggy hardware.
>>>>>>>
>>>>>>>
>>>>>>> I'm not sure if either of them is relevant to my situation.
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- Eric
>>>>>>>
>>>>>>> On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <j...@jonhaddad.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> You shouldn't need a kernel recompile.  Check out the section
>>>>>>>> "Simple solution for the problem" in http://www.alexonlinux.com/
>>>>>>>> smp-affinity-and-proper-interrupt-handling-in-linux.  You can
>>>>>>>> balance your requests across up to 8 CPUs.
>>>>>>>>
>>>>>>>> I'll check out the flame graphs in a little bit - in the middle of
>>>>>>>> something and my brain doesn't multitask well :)
>>>>>>>>
>>>>>>>> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <eric...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Jonathan -
>>>>>>>>>
>>>>>>>>> It looks like these machines are configured to use CPU 0 for all
>>>>>>>>> I/O interrupts.  I don't think I'm going to get the OK to compile a 
>>>>>>>>> new
>>>>>>>>> kernel for them to balance the interrupts across CPUs, but to 
>>>>>>>>> mitigate the
>>>>>>>>> problem I taskset the Cassandra process to run on all CPU except 0.  
>>>>>>>>> It
>>>>>>>>> didn't change the performance though.  Let me know if you think it's
>>>>>>>>> crucial that we balance the interrupts across CPUs and I can try to 
>>>>>>>>> lobby
>>>>>>>>> for a new kernel.
>>>>>>>>>
>>>>>>>>> Here are flamegraphs from each node from a cassandra-stress
>>>>>>>>> ingest into a table representative of the what we are going to be 
>>>>>>>>> using.
>>>>>>>>> This table is also roughly 200 bytes, with 64 columns and the primary 
>>>>>>>>> key (date,
>>>>>>>>> sequence_number).  Cassandra-stress was run on 3 separate client
>>>>>>>>> machines.  Using cassandra-stress to write to this table I see
>>>>>>>>> the same thing: neither disk, CPU or network is fully utilized.
>>>>>>>>>
>>>>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>>>>    /flamegraph_ultva01_sars.svg
>>>>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>>>>    /flamegraph_ultva02_sars.svg
>>>>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>>>>    /flamegraph_ultva03_sars.svg
>>>>>>>>>
>>>>>>>>> Re: GC: In the stress run with the parameters above, two of the
>>>>>>>>> three nodes log zero or one GCInspectors.  On the other hand, the
>>>>>>>>> 3rd machine logs a GCInspector every 5 seconds or so, 300-500ms
>>>>>>>>> each time.  I found out that the 3rd machine actually has different 
>>>>>>>>> specs
>>>>>>>>> as the other two.  It's an older box with the same RAM but less CPUs 
>>>>>>>>> (32
>>>>>>>>> instead of 48), a slower SSD and slower memory.   The Cassandra
>>>>>>>>> configuration is exactly the same.   I tried running Cassandra with 
>>>>>>>>> only 32
>>>>>>>>> CPUs on the newer boxes to see if that would cause them to GC pause 
>>>>>>>>> more,
>>>>>>>>> but it didn't.
>>>>>>>>>
>>>>>>>>> On a separate topic - for this cassandra-stress run I reduced the
>>>>>>>>> batch size to 2 in order to keep the logs clean.  That also reduced 
>>>>>>>>> the
>>>>>>>>> throughput from around 100k rows/second to 32k rows/sec.  I've been 
>>>>>>>>> doing
>>>>>>>>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a
>>>>>>>>> custom C++ application.  In most of the tests that I've been doing 
>>>>>>>>> I've
>>>>>>>>> been using a batch size of around 20 (unlogged, all batch rows have 
>>>>>>>>> the
>>>>>>>>> same partition key).  However, it fills the logs with batch size 
>>>>>>>>> warnings.
>>>>>>>>> I was going to raise the batch warning size but the docs scared me 
>>>>>>>>> away
>>>>>>>>> from doing that.   Given that we're using unlogged/same partition 
>>>>>>>>> batches
>>>>>>>>> is it safe to raise the batch size warning limit?   Actually cqlsh
>>>>>>>>> COPY FROM has very good throughput using a small batch size, but
>>>>>>>>> I can't get that same throughput in cassandra-stress or my C++ app 
>>>>>>>>> with a
>>>>>>>>> batch size of 2.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- Eric
>>>>>>>>>
>>>>>>>>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <
>>>>>>>>> j...@jonhaddad.com> wrote:
>>>>>>>>>
>>>>>>>>>> How many CPUs are you using for interrupts?
>>>>>>>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrup
>>>>>>>>>> t-handling-in-linux
>>>>>>>>>>
>>>>>>>>>> Have you tried making a flame graph to see where Cassandra is
>>>>>>>>>> spending its time? http://www.brendangregg.
>>>>>>>>>> com/blog/2014-06-12/java-flame-graphs.html
>>>>>>>>>>
>>>>>>>>>> Are you tracking GC pauses?
>>>>>>>>>>
>>>>>>>>>> Jon
>>>>>>>>>>
>>>>>>>>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <eric...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all:
>>>>>>>>>>>
>>>>>>>>>>> I'm new to Cassandra and I'm doing some performance testing.
>>>>>>>>>>> One of things that I'm testing is ingestion throughput.   My server 
>>>>>>>>>>> setup
>>>>>>>>>>> is:
>>>>>>>>>>>
>>>>>>>>>>>    - 3 node cluster
>>>>>>>>>>>    - SSD data (both commit log and sstables are on the same
>>>>>>>>>>>    disk)
>>>>>>>>>>>    - 64 GB RAM per server
>>>>>>>>>>>    - 48 cores per server
>>>>>>>>>>>    - Cassandra 3.0.11
>>>>>>>>>>>    - 48 Gb heap using G1GC
>>>>>>>>>>>    - 1 Gbps NICs
>>>>>>>>>>>
>>>>>>>>>>> Since I'm using SSD I've tried tuning the following (one at a
>>>>>>>>>>> time) but none seemed to make a lot of difference:
>>>>>>>>>>>
>>>>>>>>>>>    - concurrent_writes=384
>>>>>>>>>>>    - memtable_flush_writers=8
>>>>>>>>>>>    - concurrent_compactors=8
>>>>>>>>>>>
>>>>>>>>>>> I am currently doing ingestion tests sending data from 3 clients
>>>>>>>>>>> on the same subnet.  I am using cassandra-stress to do some 
>>>>>>>>>>> ingestion
>>>>>>>>>>> testing.  The tests are using CL=ONE and RF=2.
>>>>>>>>>>>
>>>>>>>>>>> Using cassandra-stress (3.10) I am able to saturate the disk
>>>>>>>>>>> using a large enough column size and the standard five column
>>>>>>>>>>> cassandra-stress schema.  For example, -col size=fixed(400)
>>>>>>>>>>> will saturate the disk and compactions will start falling behind.
>>>>>>>>>>>
>>>>>>>>>>> One of our main tables has a row size that approximately 200
>>>>>>>>>>> bytes, across 64 columns.  When ingesting this table I don't see any
>>>>>>>>>>> resource saturation.  Disk utilization is around 10-15% per
>>>>>>>>>>> iostat.  Incoming network traffic on the servers is around
>>>>>>>>>>> 100-300 Mbps.  CPU utilization is around 20-70%.  nodetool
>>>>>>>>>>> tpstats shows mostly zeros with occasional spikes around 500 in
>>>>>>>>>>> MutationStage.
>>>>>>>>>>>
>>>>>>>>>>> The stress run does 10,000,000 inserts per client, each with a
>>>>>>>>>>> separate range of partition IDs.  The run with 200 byte rows takes 
>>>>>>>>>>> about 4
>>>>>>>>>>> minutes, with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC 
>>>>>>>>>>> time 173
>>>>>>>>>>> ms.
>>>>>>>>>>>
>>>>>>>>>>> The overall performance is good - around 120k rows/sec
>>>>>>>>>>> ingested.  But I'm curious to know where the bottleneck is.  
>>>>>>>>>>> There's no
>>>>>>>>>>> resource saturation and nodetool tpstats shows only occasional
>>>>>>>>>>> brief queueing.  Is the rest just expected latency inside of 
>>>>>>>>>>> Cassandra?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> -- Eric
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

Re: Bottleneck for small inserts?

Reply via email to