Here are a couple of iostat snapshots showing the spikes in disk queue size (in these cases correlating with spikes in w/s and %util)
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 5.63 0.00 2.33 0.00 63.73 27.31 0.00 0.57 0.41 0.10 sdb 0.00 0.00 48.03 17990.63 3679.73 143925.07 8.18 23.39 1.30 0.01 22.57 dm-0 0.00 0.00 0.00 0.30 0.00 2.40 8.00 0.00 2.00 0.67 0.02 dm-2 0.00 0.00 48.03 17990.63 3679.73 143925.07 8.18 23.56 1.30 0.01 22.83 dm-3 0.00 0.00 0.00 7.67 0.00 61.33 8.00 0.00 0.44 0.10 0.08 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 2.10 0.00 1.33 0.00 27.47 20.60 0.00 0.25 0.18 0.02 sdb 0.00 16309.00 109.43 2714.23 2609.87 152186.40 54.82 11.44 4.05 0.08 23.54 dm-0 0.00 0.00 0.00 0.10 0.00 0.80 8.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 109.43 19023.30 2609.87 152186.40 8.09 273.89 14.30 0.01 23.64 dm-3 0.00 0.00 0.00 3.33 0.00 26.67 8.00 0.00 0.25 0.07 0.02 -- Eric On Wed, Jun 14, 2017 at 11:17 PM, Eric Pederson <eric...@gmail.com> wrote: > Using cassandra-stress with the out of the box schema I am seeing around > 140k rows/second throughput using 1 client on each of 3 client machines. > On the servers: > > - CPU utilization: 43% usr/20% sys, 55%/28%, 70%/10% (the last number > is the older box) > - Inbound network traffic: 174 Mbps, 190 Mbps, 178 Mbps > - Disk writes/sec: ~10k each server > - Disk utilization is in the low single digits but spikes up to 50% > - Disk queue size is in the low single digits but spikes up into the > mid hundreds. I even saw in the thousands. I had not noticed this > before. > > The disk stats come from iostat -xz 1. Given the low reported > utilization %s I would not expect to see any disk queue buildup, even low > single digits. > > Going to 2 cassandra-stress clients per machine the throughput dropped to > 133k rows/sec. > > - CPU utilization: 13% usr/5% sys, 15%/25%, 40%/22% on the older box > - Inbound network RX: 100Mbps, 125Mbps, 120Mbps > - Disk utilization is a little lower, but with the same spiky behavior > > Going to 3 cassandra-stress clients per machine the throughput dropped to > 110k rows/sec > > - CPU utilization: 15% usr/20% sys, 15%/20%, 40%/20% on the older box > - Inbound network RX dropped to 130 Mbps > - Disk utilization stayed roughly the same > > I noticed that with the standard cassandra-stress schema GC is not an > issue. But with my application-specific schema there is a lot of GC on > the slower box. Also with the application-specific schema I can't seem to > get past 36k rows/sec. The application schema has 64 columns (mostly > ints) and the key is (date,sequence#). The standard stress schema has a > lot fewer columns and no clustering column. > > Thanks, > > > > -- Eric > > On Wed, Jun 14, 2017 at 1:47 AM, Eric Pederson <eric...@gmail.com> wrote: > >> Shoot - I didn't see that one. I subscribe to the digest but was >> focusing on the direct replies and accidentally missed Patrick and Jeff >> Jirsa's messages. Sorry about that... >> >> I've been using a combination of cassandra-stress, cqlsh COPY FROM and a >> custom C++ application for my ingestion testing. My default setting for >> my custom client application is 96 threads, and then by default I run one >> client application process on each of 3 machines. I tried >> doubling/quadrupling the number of client threads (and doubling/tripling >> the number of client processes but keeping the threads per process the >> same) but didn't see any change. If I recall correctly I started getting >> timeouts after I went much beyond concurrent_writes which is 384 (for a 48 >> CPU box) - meaning at 500 threads per client machine I started seeing >> timeouts. I'll try again to be sure. >> >> For the purposes of this conversation I will try to always use >> cassandra-stress to keep the number of unknowns limited. I'll will run >> more cassandra-stress clients tomorrow in line with Patrick's 3-5 per >> server recommendation. >> >> Thanks! >> >> >> -- Eric >> >> On Wed, Jun 14, 2017 at 12:40 AM, Jonathan Haddad <j...@jonhaddad.com> >> wrote: >> >>> Did you try adding more client stress nodes as Patrick recommended? >>> >>> On Tue, Jun 13, 2017 at 9:31 PM Eric Pederson <eric...@gmail.com> wrote: >>> >>>> Scratch that theory - the flamegraphs show that GC is only 3-4% of two >>>> newer machine's overall processing, compared to 18% on the slow machine. >>>> >>>> I took that machine out of the cluster completely and recreated the >>>> keyspaces. The ingest tests now run slightly faster (!). I would have >>>> expected a linear slowdown since the load is fairly balanced across >>>> partitions. GC appears to be the bottleneck in the 3-server >>>> configuration. But still in the two-server configuration the >>>> CPU/disk/network is still not being fully utilized (the closest is CPU at >>>> ~45% on one ingest test). nodetool tpstats shows only blips of >>>> queueing. >>>> >>>> >>>> >>>> >>>> -- Eric >>>> >>>> On Mon, Jun 12, 2017 at 9:50 PM, Eric Pederson <eric...@gmail.com> >>>> wrote: >>>> >>>>> Hi all - I wanted to follow up on this. I'm happy with the throughput >>>>> we're getting but I'm still curious about the bottleneck. >>>>> >>>>> The big thing that sticks out is one of the nodes is logging frequent >>>>> GCInspector messages: 350-500ms every 3-6 seconds. All three nodes >>>>> in the cluster have identical Cassandra configuration, but the node that >>>>> is >>>>> logging frequent GCs is an older machine with slower CPU and SSD. This >>>>> node logs frequent GCInspectors both under load and when compacting >>>>> but otherwise unloaded. >>>>> >>>>> My theory is that the other two nodes have similar GC frequency >>>>> (because they are seeing the same basic load), but because they are faster >>>>> machines, they don't spend as much time per GC and don't cross the >>>>> GCInspector threshold. Does that sound plausible? nodetool tpstats >>>>> doesn't show any queueing in the system. >>>>> >>>>> Here's flamegraphs from the system when running a cqlsh COPY FROM: >>>>> >>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>> /flamegraph_ultva01_cars_batch2.svg >>>>> >>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg> >>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>> /flamegraph_ultva02_cars_batch2.svg >>>>> >>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg> >>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>> /flamegraph_ultva03_cars_batch2.svg >>>>> >>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg> >>>>> >>>>> The slow node (ultva03) spends disproportional time in GC. >>>>> >>>>> Thanks, >>>>> >>>>> >>>>> -- Eric >>>>> >>>>> On Thu, May 25, 2017 at 8:09 PM, Eric Pederson <eric...@gmail.com> >>>>> wrote: >>>>> >>>>>> Due to a cut and paste error those flamegraphs were a recording of >>>>>> the whole system, not just Cassandra. Throughput is approximately 30k >>>>>> rows/sec. >>>>>> >>>>>> Here's the graphs with just the Cassandra PID: >>>>>> >>>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>>> /flamegraph_ultva01_sars2.svg >>>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>>> /flamegraph_ultva02_sars2.svg >>>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>>> /flamegraph_ultva03_sars2.svg >>>>>> >>>>>> >>>>>> And here's graphs during a cqlsh COPY FROM to the same table, using >>>>>> real data, MAXBATCHSIZE=2. Throughput is good at approximately >>>>>> 110k rows/sec. >>>>>> >>>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>>> /flamegraph_ultva01_cars_batch2.svg >>>>>> >>>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg> >>>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>>> /flamegraph_ultva02_cars_batch2.svg >>>>>> >>>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg> >>>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>>> /flamegraph_ultva03_cars_batch2.svg >>>>>> >>>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- Eric >>>>>> >>>>>> On Thu, May 25, 2017 at 6:44 PM, Eric Pederson <eric...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Totally understood :) >>>>>>> >>>>>>> I forgot to mention - I set the /proc/irq/*/smp_affinity mask to >>>>>>> include all of the CPUs. Actually most of them were set that way >>>>>>> already >>>>>>> (for example, 0000ffff,ffffffff) - it might be because irqbalanced >>>>>>> is running. But for some reason the interrupts are all being handled on >>>>>>> CPU 0 anyway. >>>>>>> >>>>>>> I see this in /var/log/dmesg on the machines: >>>>>>> >>>>>>>> >>>>>>>> Your BIOS has requested that x2apic be disabled. >>>>>>>> This will leave your machine vulnerable to irq-injection attacks. >>>>>>>> Use 'intremap=no_x2apic_optout' to override BIOS request. >>>>>>>> Enabled IRQ remapping in xapic mode >>>>>>>> x2apic not enabled, IRQ remapping is in xapic mode >>>>>>> >>>>>>> >>>>>>> In a reply to one of the comments, he says: >>>>>>> >>>>>>> >>>>>>> When IO-APIC configured to spread interrupts among all cores, it can >>>>>>>> handle up to eight cores. If you have more than eight cores, kernel >>>>>>>> will >>>>>>>> not configure IO-APIC to spread interrupts. Thus the trick I described >>>>>>>> in >>>>>>>> the article will not work. >>>>>>>> Otherwise it may be caused by buggy BIOS or even buggy hardware. >>>>>>> >>>>>>> >>>>>>> I'm not sure if either of them is relevant to my situation. >>>>>>> >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- Eric >>>>>>> >>>>>>> On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <j...@jonhaddad.com> >>>>>>> wrote: >>>>>>> >>>>>>>> You shouldn't need a kernel recompile. Check out the section >>>>>>>> "Simple solution for the problem" in http://www.alexonlinux.com/ >>>>>>>> smp-affinity-and-proper-interrupt-handling-in-linux. You can >>>>>>>> balance your requests across up to 8 CPUs. >>>>>>>> >>>>>>>> I'll check out the flame graphs in a little bit - in the middle of >>>>>>>> something and my brain doesn't multitask well :) >>>>>>>> >>>>>>>> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <eric...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Jonathan - >>>>>>>>> >>>>>>>>> It looks like these machines are configured to use CPU 0 for all >>>>>>>>> I/O interrupts. I don't think I'm going to get the OK to compile a >>>>>>>>> new >>>>>>>>> kernel for them to balance the interrupts across CPUs, but to >>>>>>>>> mitigate the >>>>>>>>> problem I taskset the Cassandra process to run on all CPU except 0. >>>>>>>>> It >>>>>>>>> didn't change the performance though. Let me know if you think it's >>>>>>>>> crucial that we balance the interrupts across CPUs and I can try to >>>>>>>>> lobby >>>>>>>>> for a new kernel. >>>>>>>>> >>>>>>>>> Here are flamegraphs from each node from a cassandra-stress >>>>>>>>> ingest into a table representative of the what we are going to be >>>>>>>>> using. >>>>>>>>> This table is also roughly 200 bytes, with 64 columns and the primary >>>>>>>>> key (date, >>>>>>>>> sequence_number). Cassandra-stress was run on 3 separate client >>>>>>>>> machines. Using cassandra-stress to write to this table I see >>>>>>>>> the same thing: neither disk, CPU or network is fully utilized. >>>>>>>>> >>>>>>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>>>>>> /flamegraph_ultva01_sars.svg >>>>>>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>>>>>> /flamegraph_ultva02_sars.svg >>>>>>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>>>>>> /flamegraph_ultva03_sars.svg >>>>>>>>> >>>>>>>>> Re: GC: In the stress run with the parameters above, two of the >>>>>>>>> three nodes log zero or one GCInspectors. On the other hand, the >>>>>>>>> 3rd machine logs a GCInspector every 5 seconds or so, 300-500ms >>>>>>>>> each time. I found out that the 3rd machine actually has different >>>>>>>>> specs >>>>>>>>> as the other two. It's an older box with the same RAM but less CPUs >>>>>>>>> (32 >>>>>>>>> instead of 48), a slower SSD and slower memory. The Cassandra >>>>>>>>> configuration is exactly the same. I tried running Cassandra with >>>>>>>>> only 32 >>>>>>>>> CPUs on the newer boxes to see if that would cause them to GC pause >>>>>>>>> more, >>>>>>>>> but it didn't. >>>>>>>>> >>>>>>>>> On a separate topic - for this cassandra-stress run I reduced the >>>>>>>>> batch size to 2 in order to keep the logs clean. That also reduced >>>>>>>>> the >>>>>>>>> throughput from around 100k rows/second to 32k rows/sec. I've been >>>>>>>>> doing >>>>>>>>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a >>>>>>>>> custom C++ application. In most of the tests that I've been doing >>>>>>>>> I've >>>>>>>>> been using a batch size of around 20 (unlogged, all batch rows have >>>>>>>>> the >>>>>>>>> same partition key). However, it fills the logs with batch size >>>>>>>>> warnings. >>>>>>>>> I was going to raise the batch warning size but the docs scared me >>>>>>>>> away >>>>>>>>> from doing that. Given that we're using unlogged/same partition >>>>>>>>> batches >>>>>>>>> is it safe to raise the batch size warning limit? Actually cqlsh >>>>>>>>> COPY FROM has very good throughput using a small batch size, but >>>>>>>>> I can't get that same throughput in cassandra-stress or my C++ app >>>>>>>>> with a >>>>>>>>> batch size of 2. >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- Eric >>>>>>>>> >>>>>>>>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad < >>>>>>>>> j...@jonhaddad.com> wrote: >>>>>>>>> >>>>>>>>>> How many CPUs are you using for interrupts? >>>>>>>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrup >>>>>>>>>> t-handling-in-linux >>>>>>>>>> >>>>>>>>>> Have you tried making a flame graph to see where Cassandra is >>>>>>>>>> spending its time? http://www.brendangregg. >>>>>>>>>> com/blog/2014-06-12/java-flame-graphs.html >>>>>>>>>> >>>>>>>>>> Are you tracking GC pauses? >>>>>>>>>> >>>>>>>>>> Jon >>>>>>>>>> >>>>>>>>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <eric...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi all: >>>>>>>>>>> >>>>>>>>>>> I'm new to Cassandra and I'm doing some performance testing. >>>>>>>>>>> One of things that I'm testing is ingestion throughput. My server >>>>>>>>>>> setup >>>>>>>>>>> is: >>>>>>>>>>> >>>>>>>>>>> - 3 node cluster >>>>>>>>>>> - SSD data (both commit log and sstables are on the same >>>>>>>>>>> disk) >>>>>>>>>>> - 64 GB RAM per server >>>>>>>>>>> - 48 cores per server >>>>>>>>>>> - Cassandra 3.0.11 >>>>>>>>>>> - 48 Gb heap using G1GC >>>>>>>>>>> - 1 Gbps NICs >>>>>>>>>>> >>>>>>>>>>> Since I'm using SSD I've tried tuning the following (one at a >>>>>>>>>>> time) but none seemed to make a lot of difference: >>>>>>>>>>> >>>>>>>>>>> - concurrent_writes=384 >>>>>>>>>>> - memtable_flush_writers=8 >>>>>>>>>>> - concurrent_compactors=8 >>>>>>>>>>> >>>>>>>>>>> I am currently doing ingestion tests sending data from 3 clients >>>>>>>>>>> on the same subnet. I am using cassandra-stress to do some >>>>>>>>>>> ingestion >>>>>>>>>>> testing. The tests are using CL=ONE and RF=2. >>>>>>>>>>> >>>>>>>>>>> Using cassandra-stress (3.10) I am able to saturate the disk >>>>>>>>>>> using a large enough column size and the standard five column >>>>>>>>>>> cassandra-stress schema. For example, -col size=fixed(400) >>>>>>>>>>> will saturate the disk and compactions will start falling behind. >>>>>>>>>>> >>>>>>>>>>> One of our main tables has a row size that approximately 200 >>>>>>>>>>> bytes, across 64 columns. When ingesting this table I don't see any >>>>>>>>>>> resource saturation. Disk utilization is around 10-15% per >>>>>>>>>>> iostat. Incoming network traffic on the servers is around >>>>>>>>>>> 100-300 Mbps. CPU utilization is around 20-70%. nodetool >>>>>>>>>>> tpstats shows mostly zeros with occasional spikes around 500 in >>>>>>>>>>> MutationStage. >>>>>>>>>>> >>>>>>>>>>> The stress run does 10,000,000 inserts per client, each with a >>>>>>>>>>> separate range of partition IDs. The run with 200 byte rows takes >>>>>>>>>>> about 4 >>>>>>>>>>> minutes, with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC >>>>>>>>>>> time 173 >>>>>>>>>>> ms. >>>>>>>>>>> >>>>>>>>>>> The overall performance is good - around 120k rows/sec >>>>>>>>>>> ingested. But I'm curious to know where the bottleneck is. >>>>>>>>>>> There's no >>>>>>>>>>> resource saturation and nodetool tpstats shows only occasional >>>>>>>>>>> brief queueing. Is the rest just expected latency inside of >>>>>>>>>>> Cassandra? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> -- Eric >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >> >