Did you try adding more client stress nodes as Patrick recommended?

On Tue, Jun 13, 2017 at 9:31 PM Eric Pederson <eric...@gmail.com> wrote:

> Scratch that theory - the flamegraphs show that GC is only 3-4% of two
> newer machine's overall processing, compared to 18% on the slow machine.
>
> I took that machine out of the cluster completely and recreated the
> keyspaces.  The ingest tests now run slightly faster (!).   I would have
> expected a linear slowdown since the load is fairly balanced across
> partitions.  GC appears to be the bottleneck in the 3-server
> configuration.  But still in the two-server configuration the
> CPU/disk/network is still not being fully utilized (the closest is CPU at
> ~45% on one ingest test).  nodetool tpstats shows only blips of queueing.
>
>
>
>
>
> -- Eric
>
> On Mon, Jun 12, 2017 at 9:50 PM, Eric Pederson <eric...@gmail.com> wrote:
>
>> Hi all - I wanted to follow up on this.  I'm happy with the throughput
>> we're getting but I'm still curious about the bottleneck.
>>
>> The big thing that sticks out is one of the nodes is logging frequent
>> GCInspector messages: 350-500ms every 3-6 seconds.  All three nodes in
>> the cluster have identical Cassandra configuration, but the node that is
>> logging frequent GCs is an older machine with slower CPU and SSD.  This
>> node logs frequent GCInspectors both under load and when compacting but
>> otherwise unloaded.
>>
>> My theory is that the other two nodes have similar GC frequency (because
>> they are seeing the same basic load), but because they are faster machines,
>> they don't spend as much time per GC and don't cross the GCInspector
>> threshold.  Does that sound plausible?   nodetool tpstats doesn't show
>> any queueing in the system.
>>
>> Here's flamegraphs from the system when running a cqlsh COPY FROM:
>>
>>    -
>>    
>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg
>>    -
>>    
>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg
>>    -
>>    
>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg
>>
>> The slow node (ultva03) spends disproportional time in GC.
>>
>> Thanks,
>>
>>
>> -- Eric
>>
>> On Thu, May 25, 2017 at 8:09 PM, Eric Pederson <eric...@gmail.com> wrote:
>>
>>> Due to a cut and paste error those flamegraphs were a recording of the
>>> whole system, not just Cassandra.    Throughput is approximately 30k
>>> rows/sec.
>>>
>>> Here's the graphs with just the Cassandra PID:
>>>
>>>    -
>>>    
>>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_sars2.svg
>>>    -
>>>    
>>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_sars2.svg
>>>    -
>>>    
>>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_sars2.svg
>>>
>>>
>>> And here's graphs during a cqlsh COPY FROM to the same table, using
>>> real data, MAXBATCHSIZE=2.    Throughput is good at approximately 110k
>>> rows/sec.
>>>
>>>    -
>>>    
>>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg
>>>    -
>>>    
>>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg
>>>    -
>>>    
>>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg
>>>
>>>
>>>
>>>
>>> -- Eric
>>>
>>> On Thu, May 25, 2017 at 6:44 PM, Eric Pederson <eric...@gmail.com>
>>> wrote:
>>>
>>>> Totally understood :)
>>>>
>>>> I forgot to mention - I set the /proc/irq/*/smp_affinity mask to
>>>> include all of the CPUs.  Actually most of them were set that way already
>>>> (for example, 0000ffff,ffffffff) - it might be because irqbalanced is
>>>> running.  But for some reason the interrupts are all being handled on CPU 0
>>>> anyway.
>>>>
>>>> I see this in /var/log/dmesg on the machines:
>>>>
>>>>>
>>>>> Your BIOS has requested that x2apic be disabled.
>>>>> This will leave your machine vulnerable to irq-injection attacks.
>>>>> Use 'intremap=no_x2apic_optout' to override BIOS request.
>>>>> Enabled IRQ remapping in xapic mode
>>>>> x2apic not enabled, IRQ remapping is in xapic mode
>>>>
>>>>
>>>> In a reply to one of the comments, he says:
>>>>
>>>>
>>>> When IO-APIC configured to spread interrupts among all cores, it can
>>>>> handle up to eight cores. If you have more than eight cores, kernel will
>>>>> not configure IO-APIC to spread interrupts. Thus the trick I described in
>>>>> the article will not work.
>>>>> Otherwise it may be caused by buggy BIOS or even buggy hardware.
>>>>
>>>>
>>>> I'm not sure if either of them is relevant to my situation.
>>>>
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -- Eric
>>>>
>>>> On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <j...@jonhaddad.com>
>>>> wrote:
>>>>
>>>>> You shouldn't need a kernel recompile.  Check out the section "Simple
>>>>> solution for the problem" in
>>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux.
>>>>> You can balance your requests across up to 8 CPUs.
>>>>>
>>>>> I'll check out the flame graphs in a little bit - in the middle of
>>>>> something and my brain doesn't multitask well :)
>>>>>
>>>>> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <eric...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Jonathan -
>>>>>>
>>>>>> It looks like these machines are configured to use CPU 0 for all I/O
>>>>>> interrupts.  I don't think I'm going to get the OK to compile a new 
>>>>>> kernel
>>>>>> for them to balance the interrupts across CPUs, but to mitigate the 
>>>>>> problem
>>>>>> I taskset the Cassandra process to run on all CPU except 0.  It didn't
>>>>>> change the performance though.  Let me know if you think it's crucial 
>>>>>> that
>>>>>> we balance the interrupts across CPUs and I can try to lobby for a new
>>>>>> kernel.
>>>>>>
>>>>>> Here are flamegraphs from each node from a cassandra-stress ingest
>>>>>> into a table representative of the what we are going to be using.   This
>>>>>> table is also roughly 200 bytes, with 64 columns and the primary key 
>>>>>> (date,
>>>>>> sequence_number).  Cassandra-stress was run on 3 separate client
>>>>>> machines.  Using cassandra-stress to write to this table I see the
>>>>>> same thing: neither disk, CPU or network is fully utilized.
>>>>>>
>>>>>>    -
>>>>>>    
>>>>>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_sars.svg
>>>>>>    -
>>>>>>    
>>>>>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_sars.svg
>>>>>>    -
>>>>>>    
>>>>>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_sars.svg
>>>>>>
>>>>>> Re: GC: In the stress run with the parameters above, two of the three
>>>>>> nodes log zero or one GCInspectors.  On the other hand, the 3rd
>>>>>> machine logs a GCInspector every 5 seconds or so, 300-500ms each
>>>>>> time.  I found out that the 3rd machine actually has different specs as 
>>>>>> the
>>>>>> other two.  It's an older box with the same RAM but less CPUs (32 instead
>>>>>> of 48), a slower SSD and slower memory.   The Cassandra configuration is
>>>>>> exactly the same.   I tried running Cassandra with only 32 CPUs on the
>>>>>> newer boxes to see if that would cause them to GC pause more, but it 
>>>>>> didn't.
>>>>>>
>>>>>> On a separate topic - for this cassandra-stress run I reduced the
>>>>>> batch size to 2 in order to keep the logs clean.  That also reduced the
>>>>>> throughput from around 100k rows/second to 32k rows/sec.  I've been doing
>>>>>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a custom
>>>>>> C++ application.  In most of the tests that I've been doing I've been 
>>>>>> using
>>>>>> a batch size of around 20 (unlogged, all batch rows have the same 
>>>>>> partition
>>>>>> key).  However, it fills the logs with batch size warnings.  I was going 
>>>>>> to
>>>>>> raise the batch warning size but the docs scared me away from doing that.
>>>>>> Given that we're using unlogged/same partition batches is it safe to 
>>>>>> raise
>>>>>> the batch size warning limit?   Actually cqlsh COPY FROM has very
>>>>>> good throughput using a small batch size, but I can't get that same
>>>>>> throughput in cassandra-stress or my C++ app with a batch size of 2.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- Eric
>>>>>>
>>>>>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <j...@jonhaddad.com>
>>>>>> wrote:
>>>>>>
>>>>>>> How many CPUs are you using for interrupts?
>>>>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux
>>>>>>>
>>>>>>> Have you tried making a flame graph to see where Cassandra is
>>>>>>> spending its time?
>>>>>>> http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
>>>>>>>
>>>>>>> Are you tracking GC pauses?
>>>>>>>
>>>>>>> Jon
>>>>>>>
>>>>>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <eric...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all:
>>>>>>>>
>>>>>>>> I'm new to Cassandra and I'm doing some performance testing.  One
>>>>>>>> of things that I'm testing is ingestion throughput.   My server setup 
>>>>>>>> is:
>>>>>>>>
>>>>>>>>    - 3 node cluster
>>>>>>>>    - SSD data (both commit log and sstables are on the same disk)
>>>>>>>>    - 64 GB RAM per server
>>>>>>>>    - 48 cores per server
>>>>>>>>    - Cassandra 3.0.11
>>>>>>>>    - 48 Gb heap using G1GC
>>>>>>>>    - 1 Gbps NICs
>>>>>>>>
>>>>>>>> Since I'm using SSD I've tried tuning the following (one at a time)
>>>>>>>> but none seemed to make a lot of difference:
>>>>>>>>
>>>>>>>>    - concurrent_writes=384
>>>>>>>>    - memtable_flush_writers=8
>>>>>>>>    - concurrent_compactors=8
>>>>>>>>
>>>>>>>> I am currently doing ingestion tests sending data from 3 clients on
>>>>>>>> the same subnet.  I am using cassandra-stress to do some ingestion
>>>>>>>> testing.  The tests are using CL=ONE and RF=2.
>>>>>>>>
>>>>>>>> Using cassandra-stress (3.10) I am able to saturate the disk using
>>>>>>>> a large enough column size and the standard five column 
>>>>>>>> cassandra-stress
>>>>>>>> schema.  For example, -col size=fixed(400) will saturate the disk
>>>>>>>> and compactions will start falling behind.
>>>>>>>>
>>>>>>>> One of our main tables has a row size that approximately 200 bytes,
>>>>>>>> across 64 columns.  When ingesting this table I don't see any resource
>>>>>>>> saturation.  Disk utilization is around 10-15% per iostat.
>>>>>>>> Incoming network traffic on the servers is around 100-300 Mbps.  CPU
>>>>>>>> utilization is around 20-70%.  nodetool tpstats shows mostly zeros
>>>>>>>> with occasional spikes around 500 in MutationStage.
>>>>>>>>
>>>>>>>> The stress run does 10,000,000 inserts per client, each with a
>>>>>>>> separate range of partition IDs.  The run with 200 byte rows takes 
>>>>>>>> about 4
>>>>>>>> minutes, with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC 
>>>>>>>> time 173
>>>>>>>> ms.
>>>>>>>>
>>>>>>>> The overall performance is good - around 120k rows/sec ingested.
>>>>>>>> But I'm curious to know where the bottleneck is.  There's no resource
>>>>>>>> saturation and nodetool tpstats shows only occasional brief
>>>>>>>> queueing.  Is the rest just expected latency inside of Cassandra?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> -- Eric
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Reply via email to