You shouldn't need a kernel recompile. Check out the section "Simple solution for the problem" in http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux. You can balance your requests across up to 8 CPUs.
I'll check out the flame graphs in a little bit - in the middle of something and my brain doesn't multitask well :) On Thu, May 25, 2017 at 1:06 PM Eric Pederson <eric...@gmail.com> wrote: > Hi Jonathan - > > It looks like these machines are configured to use CPU 0 for all I/O > interrupts. I don't think I'm going to get the OK to compile a new kernel > for them to balance the interrupts across CPUs, but to mitigate the problem > I taskset the Cassandra process to run on all CPU except 0. It didn't > change the performance though. Let me know if you think it's crucial that > we balance the interrupts across CPUs and I can try to lobby for a new > kernel. > > Here are flamegraphs from each node from a cassandra-stress ingest into a > table representative of the what we are going to be using. This table is > also roughly 200 bytes, with 64 columns and the primary key (date, > sequence_number). Cassandra-stress was run on 3 separate client > machines. Using cassandra-stress to write to this table I see the same > thing: neither disk, CPU or network is fully utilized. > > - > > http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_sars.svg > - > > http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_sars.svg > - > > http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_sars.svg > > Re: GC: In the stress run with the parameters above, two of the three > nodes log zero or one GCInspectors. On the other hand, the 3rd machine > logs a GCInspector every 5 seconds or so, 300-500ms each time. I found > out that the 3rd machine actually has different specs as the other two. > It's an older box with the same RAM but less CPUs (32 instead of 48), a > slower SSD and slower memory. The Cassandra configuration is exactly the > same. I tried running Cassandra with only 32 CPUs on the newer boxes to > see if that would cause them to GC pause more, but it didn't. > > On a separate topic - for this cassandra-stress run I reduced the batch > size to 2 in order to keep the logs clean. That also reduced the > throughput from around 100k rows/second to 32k rows/sec. I've been doing > ingestion tests using cassandra-stress, cqlsh COPY FROM and a custom C++ > application. In most of the tests that I've been doing I've been using a > batch size of around 20 (unlogged, all batch rows have the same partition > key). However, it fills the logs with batch size warnings. I was going to > raise the batch warning size but the docs scared me away from doing that. > Given that we're using unlogged/same partition batches is it safe to raise > the batch size warning limit? Actually cqlsh COPY FROM has very good > throughput using a small batch size, but I can't get that same throughput > in cassandra-stress or my C++ app with a batch size of 2. > > Thanks! > > > > -- Eric > > On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <j...@jonhaddad.com> > wrote: > >> How many CPUs are you using for interrupts? >> http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux >> >> Have you tried making a flame graph to see where Cassandra is spending >> its time? >> http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html >> >> Are you tracking GC pauses? >> >> Jon >> >> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <eric...@gmail.com> wrote: >> >>> Hi all: >>> >>> I'm new to Cassandra and I'm doing some performance testing. One of >>> things that I'm testing is ingestion throughput. My server setup is: >>> >>> - 3 node cluster >>> - SSD data (both commit log and sstables are on the same disk) >>> - 64 GB RAM per server >>> - 48 cores per server >>> - Cassandra 3.0.11 >>> - 48 Gb heap using G1GC >>> - 1 Gbps NICs >>> >>> Since I'm using SSD I've tried tuning the following (one at a time) but >>> none seemed to make a lot of difference: >>> >>> - concurrent_writes=384 >>> - memtable_flush_writers=8 >>> - concurrent_compactors=8 >>> >>> I am currently doing ingestion tests sending data from 3 clients on the >>> same subnet. I am using cassandra-stress to do some ingestion testing. >>> The tests are using CL=ONE and RF=2. >>> >>> Using cassandra-stress (3.10) I am able to saturate the disk using a >>> large enough column size and the standard five column cassandra-stress >>> schema. For example, -col size=fixed(400) will saturate the disk and >>> compactions will start falling behind. >>> >>> One of our main tables has a row size that approximately 200 bytes, >>> across 64 columns. When ingesting this table I don't see any resource >>> saturation. Disk utilization is around 10-15% per iostat. Incoming >>> network traffic on the servers is around 100-300 Mbps. CPU utilization is >>> around 20-70%. nodetool tpstats shows mostly zeros with occasional >>> spikes around 500 in MutationStage. >>> >>> The stress run does 10,000,000 inserts per client, each with a separate >>> range of partition IDs. The run with 200 byte rows takes about 4 minutes, >>> with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time 173 ms. >>> >>> The overall performance is good - around 120k rows/sec ingested. But >>> I'm curious to know where the bottleneck is. There's no resource >>> saturation and nodetool tpstats shows only occasional brief queueing. >>> Is the rest just expected latency inside of Cassandra? >>> >>> Thanks, >>> >>> -- Eric >>> >> >