Did you try adding more client stress nodes as Patrick recommended? On Tue, Jun 13, 2017 at 9:31 PM Eric Pederson <eric...@gmail.com> wrote:
> Scratch that theory - the flamegraphs show that GC is only 3-4% of two > newer machine's overall processing, compared to 18% on the slow machine. > > I took that machine out of the cluster completely and recreated the > keyspaces. The ingest tests now run slightly faster (!). I would have > expected a linear slowdown since the load is fairly balanced across > partitions. GC appears to be the bottleneck in the 3-server > configuration. But still in the two-server configuration the > CPU/disk/network is still not being fully utilized (the closest is CPU at > ~45% on one ingest test). nodetool tpstats shows only blips of queueing. > > > > > > -- Eric > > On Mon, Jun 12, 2017 at 9:50 PM, Eric Pederson <eric...@gmail.com> wrote: > >> Hi all - I wanted to follow up on this. I'm happy with the throughput >> we're getting but I'm still curious about the bottleneck. >> >> The big thing that sticks out is one of the nodes is logging frequent >> GCInspector messages: 350-500ms every 3-6 seconds. All three nodes in >> the cluster have identical Cassandra configuration, but the node that is >> logging frequent GCs is an older machine with slower CPU and SSD. This >> node logs frequent GCInspectors both under load and when compacting but >> otherwise unloaded. >> >> My theory is that the other two nodes have similar GC frequency (because >> they are seeing the same basic load), but because they are faster machines, >> they don't spend as much time per GC and don't cross the GCInspector >> threshold. Does that sound plausible? nodetool tpstats doesn't show >> any queueing in the system. >> >> Here's flamegraphs from the system when running a cqlsh COPY FROM: >> >> - >> >> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg >> - >> >> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg >> - >> >> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg >> >> The slow node (ultva03) spends disproportional time in GC. >> >> Thanks, >> >> >> -- Eric >> >> On Thu, May 25, 2017 at 8:09 PM, Eric Pederson <eric...@gmail.com> wrote: >> >>> Due to a cut and paste error those flamegraphs were a recording of the >>> whole system, not just Cassandra. Throughput is approximately 30k >>> rows/sec. >>> >>> Here's the graphs with just the Cassandra PID: >>> >>> - >>> >>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_sars2.svg >>> - >>> >>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_sars2.svg >>> - >>> >>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_sars2.svg >>> >>> >>> And here's graphs during a cqlsh COPY FROM to the same table, using >>> real data, MAXBATCHSIZE=2. Throughput is good at approximately 110k >>> rows/sec. >>> >>> - >>> >>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg >>> - >>> >>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg >>> - >>> >>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg >>> >>> >>> >>> >>> -- Eric >>> >>> On Thu, May 25, 2017 at 6:44 PM, Eric Pederson <eric...@gmail.com> >>> wrote: >>> >>>> Totally understood :) >>>> >>>> I forgot to mention - I set the /proc/irq/*/smp_affinity mask to >>>> include all of the CPUs. Actually most of them were set that way already >>>> (for example, 0000ffff,ffffffff) - it might be because irqbalanced is >>>> running. But for some reason the interrupts are all being handled on CPU 0 >>>> anyway. >>>> >>>> I see this in /var/log/dmesg on the machines: >>>> >>>>> >>>>> Your BIOS has requested that x2apic be disabled. >>>>> This will leave your machine vulnerable to irq-injection attacks. >>>>> Use 'intremap=no_x2apic_optout' to override BIOS request. >>>>> Enabled IRQ remapping in xapic mode >>>>> x2apic not enabled, IRQ remapping is in xapic mode >>>> >>>> >>>> In a reply to one of the comments, he says: >>>> >>>> >>>> When IO-APIC configured to spread interrupts among all cores, it can >>>>> handle up to eight cores. If you have more than eight cores, kernel will >>>>> not configure IO-APIC to spread interrupts. Thus the trick I described in >>>>> the article will not work. >>>>> Otherwise it may be caused by buggy BIOS or even buggy hardware. >>>> >>>> >>>> I'm not sure if either of them is relevant to my situation. >>>> >>>> >>>> Thanks! >>>> >>>> >>>> >>>> >>>> >>>> -- Eric >>>> >>>> On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <j...@jonhaddad.com> >>>> wrote: >>>> >>>>> You shouldn't need a kernel recompile. Check out the section "Simple >>>>> solution for the problem" in >>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux. >>>>> You can balance your requests across up to 8 CPUs. >>>>> >>>>> I'll check out the flame graphs in a little bit - in the middle of >>>>> something and my brain doesn't multitask well :) >>>>> >>>>> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <eric...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Jonathan - >>>>>> >>>>>> It looks like these machines are configured to use CPU 0 for all I/O >>>>>> interrupts. I don't think I'm going to get the OK to compile a new >>>>>> kernel >>>>>> for them to balance the interrupts across CPUs, but to mitigate the >>>>>> problem >>>>>> I taskset the Cassandra process to run on all CPU except 0. It didn't >>>>>> change the performance though. Let me know if you think it's crucial >>>>>> that >>>>>> we balance the interrupts across CPUs and I can try to lobby for a new >>>>>> kernel. >>>>>> >>>>>> Here are flamegraphs from each node from a cassandra-stress ingest >>>>>> into a table representative of the what we are going to be using. This >>>>>> table is also roughly 200 bytes, with 64 columns and the primary key >>>>>> (date, >>>>>> sequence_number). Cassandra-stress was run on 3 separate client >>>>>> machines. Using cassandra-stress to write to this table I see the >>>>>> same thing: neither disk, CPU or network is fully utilized. >>>>>> >>>>>> - >>>>>> >>>>>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_sars.svg >>>>>> - >>>>>> >>>>>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_sars.svg >>>>>> - >>>>>> >>>>>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_sars.svg >>>>>> >>>>>> Re: GC: In the stress run with the parameters above, two of the three >>>>>> nodes log zero or one GCInspectors. On the other hand, the 3rd >>>>>> machine logs a GCInspector every 5 seconds or so, 300-500ms each >>>>>> time. I found out that the 3rd machine actually has different specs as >>>>>> the >>>>>> other two. It's an older box with the same RAM but less CPUs (32 instead >>>>>> of 48), a slower SSD and slower memory. The Cassandra configuration is >>>>>> exactly the same. I tried running Cassandra with only 32 CPUs on the >>>>>> newer boxes to see if that would cause them to GC pause more, but it >>>>>> didn't. >>>>>> >>>>>> On a separate topic - for this cassandra-stress run I reduced the >>>>>> batch size to 2 in order to keep the logs clean. That also reduced the >>>>>> throughput from around 100k rows/second to 32k rows/sec. I've been doing >>>>>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a custom >>>>>> C++ application. In most of the tests that I've been doing I've been >>>>>> using >>>>>> a batch size of around 20 (unlogged, all batch rows have the same >>>>>> partition >>>>>> key). However, it fills the logs with batch size warnings. I was going >>>>>> to >>>>>> raise the batch warning size but the docs scared me away from doing that. >>>>>> Given that we're using unlogged/same partition batches is it safe to >>>>>> raise >>>>>> the batch size warning limit? Actually cqlsh COPY FROM has very >>>>>> good throughput using a small batch size, but I can't get that same >>>>>> throughput in cassandra-stress or my C++ app with a batch size of 2. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> >>>>>> >>>>>> -- Eric >>>>>> >>>>>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <j...@jonhaddad.com> >>>>>> wrote: >>>>>> >>>>>>> How many CPUs are you using for interrupts? >>>>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux >>>>>>> >>>>>>> Have you tried making a flame graph to see where Cassandra is >>>>>>> spending its time? >>>>>>> http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html >>>>>>> >>>>>>> Are you tracking GC pauses? >>>>>>> >>>>>>> Jon >>>>>>> >>>>>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <eric...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi all: >>>>>>>> >>>>>>>> I'm new to Cassandra and I'm doing some performance testing. One >>>>>>>> of things that I'm testing is ingestion throughput. My server setup >>>>>>>> is: >>>>>>>> >>>>>>>> - 3 node cluster >>>>>>>> - SSD data (both commit log and sstables are on the same disk) >>>>>>>> - 64 GB RAM per server >>>>>>>> - 48 cores per server >>>>>>>> - Cassandra 3.0.11 >>>>>>>> - 48 Gb heap using G1GC >>>>>>>> - 1 Gbps NICs >>>>>>>> >>>>>>>> Since I'm using SSD I've tried tuning the following (one at a time) >>>>>>>> but none seemed to make a lot of difference: >>>>>>>> >>>>>>>> - concurrent_writes=384 >>>>>>>> - memtable_flush_writers=8 >>>>>>>> - concurrent_compactors=8 >>>>>>>> >>>>>>>> I am currently doing ingestion tests sending data from 3 clients on >>>>>>>> the same subnet. I am using cassandra-stress to do some ingestion >>>>>>>> testing. The tests are using CL=ONE and RF=2. >>>>>>>> >>>>>>>> Using cassandra-stress (3.10) I am able to saturate the disk using >>>>>>>> a large enough column size and the standard five column >>>>>>>> cassandra-stress >>>>>>>> schema. For example, -col size=fixed(400) will saturate the disk >>>>>>>> and compactions will start falling behind. >>>>>>>> >>>>>>>> One of our main tables has a row size that approximately 200 bytes, >>>>>>>> across 64 columns. When ingesting this table I don't see any resource >>>>>>>> saturation. Disk utilization is around 10-15% per iostat. >>>>>>>> Incoming network traffic on the servers is around 100-300 Mbps. CPU >>>>>>>> utilization is around 20-70%. nodetool tpstats shows mostly zeros >>>>>>>> with occasional spikes around 500 in MutationStage. >>>>>>>> >>>>>>>> The stress run does 10,000,000 inserts per client, each with a >>>>>>>> separate range of partition IDs. The run with 200 byte rows takes >>>>>>>> about 4 >>>>>>>> minutes, with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC >>>>>>>> time 173 >>>>>>>> ms. >>>>>>>> >>>>>>>> The overall performance is good - around 120k rows/sec ingested. >>>>>>>> But I'm curious to know where the bottleneck is. There's no resource >>>>>>>> saturation and nodetool tpstats shows only occasional brief >>>>>>>> queueing. Is the rest just expected latency inside of Cassandra? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> -- Eric >>>>>>>> >>>>>>> >>>>>> >>>> >>> >> >