Totally understood :) I forgot to mention - I set the /proc/irq/*/smp_affinity mask to include all of the CPUs. Actually most of them were set that way already (for example, 0000ffff,ffffffff) - it might be because irqbalanced is running. But for some reason the interrupts are all being handled on CPU 0 anyway.
I see this in /var/log/dmesg on the machines: > > Your BIOS has requested that x2apic be disabled. > This will leave your machine vulnerable to irq-injection attacks. > Use 'intremap=no_x2apic_optout' to override BIOS request. > Enabled IRQ remapping in xapic mode > x2apic not enabled, IRQ remapping is in xapic mode In a reply to one of the comments, he says: When IO-APIC configured to spread interrupts among all cores, it can handle > up to eight cores. If you have more than eight cores, kernel will not > configure IO-APIC to spread interrupts. Thus the trick I described in the > article will not work. > Otherwise it may be caused by buggy BIOS or even buggy hardware. I'm not sure if either of them is relevant to my situation. Thanks! -- Eric On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <j...@jonhaddad.com> wrote: > You shouldn't need a kernel recompile. Check out the section "Simple > solution for the problem" in http://www.alexonlinux.com/ > smp-affinity-and-proper-interrupt-handling-in-linux. You can balance > your requests across up to 8 CPUs. > > I'll check out the flame graphs in a little bit - in the middle of > something and my brain doesn't multitask well :) > > On Thu, May 25, 2017 at 1:06 PM Eric Pederson <eric...@gmail.com> wrote: > >> Hi Jonathan - >> >> It looks like these machines are configured to use CPU 0 for all I/O >> interrupts. I don't think I'm going to get the OK to compile a new kernel >> for them to balance the interrupts across CPUs, but to mitigate the problem >> I taskset the Cassandra process to run on all CPU except 0. It didn't >> change the performance though. Let me know if you think it's crucial that >> we balance the interrupts across CPUs and I can try to lobby for a new >> kernel. >> >> Here are flamegraphs from each node from a cassandra-stress ingest into >> a table representative of the what we are going to be using. This table >> is also roughly 200 bytes, with 64 columns and the primary key (date, >> sequence_number). Cassandra-stress was run on 3 separate client >> machines. Using cassandra-stress to write to this table I see the same >> thing: neither disk, CPU or network is fully utilized. >> >> - http://sourcedelica.com/wordpress/wp-content/uploads/ >> 2017/05/flamegraph_ultva01_sars.svg >> >> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_sars.svg> >> - http://sourcedelica.com/wordpress/wp-content/uploads/ >> 2017/05/flamegraph_ultva02_sars.svg >> >> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_sars.svg> >> - http://sourcedelica.com/wordpress/wp-content/uploads/ >> 2017/05/flamegraph_ultva03_sars.svg >> >> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_sars.svg> >> >> Re: GC: In the stress run with the parameters above, two of the three >> nodes log zero or one GCInspectors. On the other hand, the 3rd machine >> logs a GCInspector every 5 seconds or so, 300-500ms each time. I found >> out that the 3rd machine actually has different specs as the other two. >> It's an older box with the same RAM but less CPUs (32 instead of 48), a >> slower SSD and slower memory. The Cassandra configuration is exactly the >> same. I tried running Cassandra with only 32 CPUs on the newer boxes to >> see if that would cause them to GC pause more, but it didn't. >> >> On a separate topic - for this cassandra-stress run I reduced the batch >> size to 2 in order to keep the logs clean. That also reduced the >> throughput from around 100k rows/second to 32k rows/sec. I've been doing >> ingestion tests using cassandra-stress, cqlsh COPY FROM and a custom C++ >> application. In most of the tests that I've been doing I've been using a >> batch size of around 20 (unlogged, all batch rows have the same partition >> key). However, it fills the logs with batch size warnings. I was going to >> raise the batch warning size but the docs scared me away from doing that. >> Given that we're using unlogged/same partition batches is it safe to raise >> the batch size warning limit? Actually cqlsh COPY FROM has very good >> throughput using a small batch size, but I can't get that same throughput >> in cassandra-stress or my C++ app with a batch size of 2. >> >> Thanks! >> >> >> >> -- Eric >> >> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <j...@jonhaddad.com> >> wrote: >> >>> How many CPUs are you using for interrupts? http://www.alexonlinux.com/ >>> smp-affinity-and-proper-interrupt-handling-in-linux >>> >>> Have you tried making a flame graph to see where Cassandra is spending >>> its time? http://www.brendangregg.com/blog/2014-06-12/java- >>> flame-graphs.html >>> >>> Are you tracking GC pauses? >>> >>> Jon >>> >>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <eric...@gmail.com> wrote: >>> >>>> Hi all: >>>> >>>> I'm new to Cassandra and I'm doing some performance testing. One of >>>> things that I'm testing is ingestion throughput. My server setup is: >>>> >>>> - 3 node cluster >>>> - SSD data (both commit log and sstables are on the same disk) >>>> - 64 GB RAM per server >>>> - 48 cores per server >>>> - Cassandra 3.0.11 >>>> - 48 Gb heap using G1GC >>>> - 1 Gbps NICs >>>> >>>> Since I'm using SSD I've tried tuning the following (one at a time) but >>>> none seemed to make a lot of difference: >>>> >>>> - concurrent_writes=384 >>>> - memtable_flush_writers=8 >>>> - concurrent_compactors=8 >>>> >>>> I am currently doing ingestion tests sending data from 3 clients on the >>>> same subnet. I am using cassandra-stress to do some ingestion testing. >>>> The tests are using CL=ONE and RF=2. >>>> >>>> Using cassandra-stress (3.10) I am able to saturate the disk using a >>>> large enough column size and the standard five column cassandra-stress >>>> schema. For example, -col size=fixed(400) will saturate the disk and >>>> compactions will start falling behind. >>>> >>>> One of our main tables has a row size that approximately 200 bytes, >>>> across 64 columns. When ingesting this table I don't see any resource >>>> saturation. Disk utilization is around 10-15% per iostat. Incoming >>>> network traffic on the servers is around 100-300 Mbps. CPU utilization is >>>> around 20-70%. nodetool tpstats shows mostly zeros with occasional >>>> spikes around 500 in MutationStage. >>>> >>>> The stress run does 10,000,000 inserts per client, each with a separate >>>> range of partition IDs. The run with 200 byte rows takes about 4 minutes, >>>> with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time 173 ms. >>>> >>>> The overall performance is good - around 120k rows/sec ingested. But >>>> I'm curious to know where the bottleneck is. There's no resource >>>> saturation and nodetool tpstats shows only occasional brief queueing. >>>> Is the rest just expected latency inside of Cassandra? >>>> >>>> Thanks, >>>> >>>> -- Eric >>>> >>> >>