Scratch that theory - the flamegraphs show that GC is only 3-4% of two
newer machine's overall processing, compared to 18% on the slow machine.

I took that machine out of the cluster completely and recreated the
keyspaces.  The ingest tests now run slightly faster (!).   I would have
expected a linear slowdown since the load is fairly balanced across
partitions.  GC appears to be the bottleneck in the 3-server
configuration.  But still in the two-server configuration the
CPU/disk/network is still not being fully utilized (the closest is CPU at
~45% on one ingest test).  nodetool tpstats shows only blips of queueing.




-- Eric

On Mon, Jun 12, 2017 at 9:50 PM, Eric Pederson <eric...@gmail.com> wrote:

> Hi all - I wanted to follow up on this.  I'm happy with the throughput
> we're getting but I'm still curious about the bottleneck.
>
> The big thing that sticks out is one of the nodes is logging frequent
> GCInspector messages: 350-500ms every 3-6 seconds.  All three nodes in
> the cluster have identical Cassandra configuration, but the node that is
> logging frequent GCs is an older machine with slower CPU and SSD.  This
> node logs frequent GCInspectors both under load and when compacting but
> otherwise unloaded.
>
> My theory is that the other two nodes have similar GC frequency (because
> they are seeing the same basic load), but because they are faster machines,
> they don't spend as much time per GC and don't cross the GCInspector
> threshold.  Does that sound plausible?   nodetool tpstats doesn't show
> any queueing in the system.
>
> Here's flamegraphs from the system when running a cqlsh COPY FROM:
>
>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>    05/flamegraph_ultva01_cars_batch2.svg
>    
> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg>
>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>    05/flamegraph_ultva02_cars_batch2.svg
>    
> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg>
>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>    05/flamegraph_ultva03_cars_batch2.svg
>    
> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg>
>
> The slow node (ultva03) spends disproportional time in GC.
>
> Thanks,
>
>
> -- Eric
>
> On Thu, May 25, 2017 at 8:09 PM, Eric Pederson <eric...@gmail.com> wrote:
>
>> Due to a cut and paste error those flamegraphs were a recording of the
>> whole system, not just Cassandra.    Throughput is approximately 30k
>> rows/sec.
>>
>> Here's the graphs with just the Cassandra PID:
>>
>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>    05/flamegraph_ultva01_sars2.svg
>>    
>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_sars2.svg>
>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>    05/flamegraph_ultva02_sars2.svg
>>    
>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_sars2.svg>
>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>    05/flamegraph_ultva03_sars2.svg
>>    
>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_sars2.svg>
>>
>>
>> And here's graphs during a cqlsh COPY FROM to the same table, using real
>> data, MAXBATCHSIZE=2.    Throughput is good at approximately 110k
>> rows/sec.
>>
>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>    05/flamegraph_ultva01_cars_batch2.svg
>>    
>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg>
>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>    05/flamegraph_ultva02_cars_batch2.svg
>>    
>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg>
>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>    05/flamegraph_ultva03_cars_batch2.svg
>>    
>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg>
>>
>>
>>
>>
>> -- Eric
>>
>> On Thu, May 25, 2017 at 6:44 PM, Eric Pederson <eric...@gmail.com> wrote:
>>
>>> Totally understood :)
>>>
>>> I forgot to mention - I set the /proc/irq/*/smp_affinity mask to
>>> include all of the CPUs.  Actually most of them were set that way already
>>> (for example, 0000ffff,ffffffff) - it might be because irqbalanced is
>>> running.  But for some reason the interrupts are all being handled on CPU 0
>>> anyway.
>>>
>>> I see this in /var/log/dmesg on the machines:
>>>
>>>>
>>>> Your BIOS has requested that x2apic be disabled.
>>>> This will leave your machine vulnerable to irq-injection attacks.
>>>> Use 'intremap=no_x2apic_optout' to override BIOS request.
>>>> Enabled IRQ remapping in xapic mode
>>>> x2apic not enabled, IRQ remapping is in xapic mode
>>>
>>>
>>> In a reply to one of the comments, he says:
>>>
>>>
>>> When IO-APIC configured to spread interrupts among all cores, it can
>>>> handle up to eight cores. If you have more than eight cores, kernel will
>>>> not configure IO-APIC to spread interrupts. Thus the trick I described in
>>>> the article will not work.
>>>> Otherwise it may be caused by buggy BIOS or even buggy hardware.
>>>
>>>
>>> I'm not sure if either of them is relevant to my situation.
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>>
>>> -- Eric
>>>
>>> On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <j...@jonhaddad.com>
>>> wrote:
>>>
>>>> You shouldn't need a kernel recompile.  Check out the section "Simple
>>>> solution for the problem" in http://www.alexonlinux.com/
>>>> smp-affinity-and-proper-interrupt-handling-in-linux.  You can balance
>>>> your requests across up to 8 CPUs.
>>>>
>>>> I'll check out the flame graphs in a little bit - in the middle of
>>>> something and my brain doesn't multitask well :)
>>>>
>>>> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <eric...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Jonathan -
>>>>>
>>>>> It looks like these machines are configured to use CPU 0 for all I/O
>>>>> interrupts.  I don't think I'm going to get the OK to compile a new kernel
>>>>> for them to balance the interrupts across CPUs, but to mitigate the 
>>>>> problem
>>>>> I taskset the Cassandra process to run on all CPU except 0.  It didn't
>>>>> change the performance though.  Let me know if you think it's crucial that
>>>>> we balance the interrupts across CPUs and I can try to lobby for a new
>>>>> kernel.
>>>>>
>>>>> Here are flamegraphs from each node from a cassandra-stress ingest
>>>>> into a table representative of the what we are going to be using.   This
>>>>> table is also roughly 200 bytes, with 64 columns and the primary key 
>>>>> (date,
>>>>> sequence_number).  Cassandra-stress was run on 3 separate client
>>>>> machines.  Using cassandra-stress to write to this table I see the
>>>>> same thing: neither disk, CPU or network is fully utilized.
>>>>>
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva01_sars.svg
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva02_sars.svg
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva03_sars.svg
>>>>>
>>>>> Re: GC: In the stress run with the parameters above, two of the three
>>>>> nodes log zero or one GCInspectors.  On the other hand, the 3rd
>>>>> machine logs a GCInspector every 5 seconds or so, 300-500ms each
>>>>> time.  I found out that the 3rd machine actually has different specs as 
>>>>> the
>>>>> other two.  It's an older box with the same RAM but less CPUs (32 instead
>>>>> of 48), a slower SSD and slower memory.   The Cassandra configuration is
>>>>> exactly the same.   I tried running Cassandra with only 32 CPUs on the
>>>>> newer boxes to see if that would cause them to GC pause more, but it 
>>>>> didn't.
>>>>>
>>>>> On a separate topic - for this cassandra-stress run I reduced the
>>>>> batch size to 2 in order to keep the logs clean.  That also reduced the
>>>>> throughput from around 100k rows/second to 32k rows/sec.  I've been doing
>>>>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a custom
>>>>> C++ application.  In most of the tests that I've been doing I've been 
>>>>> using
>>>>> a batch size of around 20 (unlogged, all batch rows have the same 
>>>>> partition
>>>>> key).  However, it fills the logs with batch size warnings.  I was going 
>>>>> to
>>>>> raise the batch warning size but the docs scared me away from doing that.
>>>>> Given that we're using unlogged/same partition batches is it safe to raise
>>>>> the batch size warning limit?   Actually cqlsh COPY FROM has very
>>>>> good throughput using a small batch size, but I can't get that same
>>>>> throughput in cassandra-stress or my C++ app with a batch size of 2.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>> -- Eric
>>>>>
>>>>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <j...@jonhaddad.com>
>>>>> wrote:
>>>>>
>>>>>> How many CPUs are you using for interrupts?
>>>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrup
>>>>>> t-handling-in-linux
>>>>>>
>>>>>> Have you tried making a flame graph to see where Cassandra is
>>>>>> spending its time? http://www.brendangregg.
>>>>>> com/blog/2014-06-12/java-flame-graphs.html
>>>>>>
>>>>>> Are you tracking GC pauses?
>>>>>>
>>>>>> Jon
>>>>>>
>>>>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <eric...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all:
>>>>>>>
>>>>>>> I'm new to Cassandra and I'm doing some performance testing.  One of
>>>>>>> things that I'm testing is ingestion throughput.   My server setup is:
>>>>>>>
>>>>>>>    - 3 node cluster
>>>>>>>    - SSD data (both commit log and sstables are on the same disk)
>>>>>>>    - 64 GB RAM per server
>>>>>>>    - 48 cores per server
>>>>>>>    - Cassandra 3.0.11
>>>>>>>    - 48 Gb heap using G1GC
>>>>>>>    - 1 Gbps NICs
>>>>>>>
>>>>>>> Since I'm using SSD I've tried tuning the following (one at a time)
>>>>>>> but none seemed to make a lot of difference:
>>>>>>>
>>>>>>>    - concurrent_writes=384
>>>>>>>    - memtable_flush_writers=8
>>>>>>>    - concurrent_compactors=8
>>>>>>>
>>>>>>> I am currently doing ingestion tests sending data from 3 clients on
>>>>>>> the same subnet.  I am using cassandra-stress to do some ingestion
>>>>>>> testing.  The tests are using CL=ONE and RF=2.
>>>>>>>
>>>>>>> Using cassandra-stress (3.10) I am able to saturate the disk using a
>>>>>>> large enough column size and the standard five column cassandra-stress
>>>>>>> schema.  For example, -col size=fixed(400) will saturate the disk
>>>>>>> and compactions will start falling behind.
>>>>>>>
>>>>>>> One of our main tables has a row size that approximately 200 bytes,
>>>>>>> across 64 columns.  When ingesting this table I don't see any resource
>>>>>>> saturation.  Disk utilization is around 10-15% per iostat.
>>>>>>> Incoming network traffic on the servers is around 100-300 Mbps.  CPU
>>>>>>> utilization is around 20-70%.  nodetool tpstats shows mostly zeros
>>>>>>> with occasional spikes around 500 in MutationStage.
>>>>>>>
>>>>>>> The stress run does 10,000,000 inserts per client, each with a
>>>>>>> separate range of partition IDs.  The run with 200 byte rows takes 
>>>>>>> about 4
>>>>>>> minutes, with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time 
>>>>>>> 173
>>>>>>> ms.
>>>>>>>
>>>>>>> The overall performance is good - around 120k rows/sec ingested.
>>>>>>> But I'm curious to know where the bottleneck is.  There's no resource
>>>>>>> saturation and nodetool tpstats shows only occasional brief
>>>>>>> queueing.  Is the rest just expected latency inside of Cassandra?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> -- Eric
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Reply via email to