[jira] [Comment Edited] (CASSANDRA-15066) Improvements to Internode Messaging

Joseph Lynch (JIRA) Fri, 26 Apr 2019 17:51:40 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827395#comment-16827395
 ]


Joseph Lynch edited comment on CASSANDRA-15066 at 4/27/19 12:50 AM:
--------------------------------------------------------------------

Note, this is not a comparative analysis nor do we have root causes for all 
findings. [~vinaykumarcse] and my goal today was to kick the tires of this 
patch and see if there were any serious issues. We threw 
[{{9b0b814add}}|https://github.com/apache/cassandra/commit/9b0b814add8da5a66c12a87a7bfebb015667f293]
 on a smallish cluster and punished it with read and write load.

*Test setup*
 * Two datacenters, approximately 70ms apart
 * 6 {{i3.2xlarge}} nodes per datacenter (4 physical cores with 8 threads, 
60GiB of memory, 1.8 TB NVMe dedicated drive, 2 Gbps network)
 * 3 node NDBench cluster generating {{LOCAL_ONE }} random writes and full 
partition reads of ~4kb partitions consisting of 2 rows of 10 columns each. 
Total dataset per node was about ~180GiB and reads were uniformly distributed 
across the partition space. The table was mostly defaults (RF=3) except it used 
Leveled Compaction Strategy and no compression (since the data is random).

*First test, bootstrap (aka punish with writes)*

In this test we used NDBench's backfill feature to punish the cluster with 
writes.
 * Backfilling the dataset achieved sustained write throughput of 20k 
coordinator level WPS easily, with average latencies staying below 1ms
 * The limiting factor appeared to be compaction throughput
 * Flamegraphs are attached

There were no observed hints or dropped messages and datasizes in both 
datacenters looked reasonably consistent. I think this went very well.

!20k_backfill.png|thumbnail!

*Second test, establish baseline*

Next, we sent a reasonably modest 1200 coordinator RPS and 600 WPS, which is 
very light load, and compared this patch to our production 30x production 
branch.
 * Writes are ~20%faster, like we saw previously in netty trunk vs 30x
 * Reads are *~500%* slower, this is new since our last tests and from the 
flamegraph [~benedict] suspects and I agree that it was likely related to some 
of the TR cleanup
 * Checked the virtual table metrics and they seem reasonable, also spot 
checked some of the new jmx per channel metrics

Summary: The read latency is concerning, but I think Benedict may already have 
the fix.

*Third test, punish with reads*

Due to the poor baseline read performance, we attempted to push the reads as 
far as they would go while acquiring a flamegraph for debugging where we are 
spending time.
 * We were able to push the cluster to 60,000 coordinator RPS before we started 
seeing CPU queuing.
 * The limiting factor appeared to be on CPU time (~about 80% saturated) and 
random 4k IOP speed (although we were only ~30% saturated there)
 * Flamegraphs are attached

tpstats showed relatively little queueing or QOS issues, and local read 
latencies remained fast, so we believe that there is a different issue at play 
in the read path. Flamegraphs are attached for debugging.

*Fourth test, punish with reads and writes*

We're currently attempting a mixed mode test where we do many reads and writes 
and see how they interact. Results will be posted shortly. I think we'll need 
to bump our branch to pickup the latest changes.

*Summary*

So far this patch looks to be doing a great job, we have some issues to figure 
out with the reads and many more tests to run, but it didn't explode so that is 
good heh.

 


was (Author: jolynch):
Note, this is not a comparative analysis nor do we have root causes for all 
findings. [~vinaykumarcse] and my goal today was to kick the tires of this 
patch and see if there were any serious issues. We threw 
[{{9b0b814add}}|https://github.com/apache/cassandra/commit/9b0b814add8da5a66c12a87a7bfebb015667f293]
 on a smallish cluster and punished it with read and write load.

*Test setup*
 * Two datacenters, approximately 70ms apart
 * 6 {{i3.2xlarge}} nodes per datacenter (4 physical cores with 8 threads, 
60GiB of memory, 1.8 TB NVMe dedicated drive, 2 Gbps network)
 * 3 node NDBench cluster generating {{LOCAL_ONE }} random writes and full 
partition reads of ~4kb partitions consisting of 2 rows of 10 columns each. 
Total dataset per node was about ~180GiB and reads were uniformly distributed 
across the partition space. The table was mostly defaults (RF=3) except it used 
Leveled Compaction Strategy and no compression (since the data is random).

*First test, bootstrap (aka punish with writes)*

In this test we used NDBench's backfill feature to punish the cluster with 
writes.
 * Backfilling the dataset achieved sustained write throughput of 20k 
coordinator level WPS easily, with average latencies staying below 1ms
 * The limiting factor appeared to be compaction throughput
 * Flamegraphs are attached

There were no observed hints or dropped messages and datasizes in both 
datacenters looked reasonably consistent. I think this went very well.

*Second test, establish baseline*

Next, we sent a reasonably modest 1200 coordinator RPS and 600 WPS, which is 
very light load, and compared this patch to our production 30x production 
branch.
 * Writes are ~20%faster, like we saw previously in netty trunk vs 30x
 * Reads are *~500%* slower, this is new since our last tests and from the 
flamegraph [~benedict] suspects and I agree that it was likely related to some 
of the TR cleanup
 * Checked the virtual table metrics and they seem reasonable, also spot 
checked some of the new jmx per channel metrics

Summary: The read latency is concerning, but I think Benedict may already have 
the fix.

*Third test, punish with reads*

Due to the poor baseline read performance, we attempted to push the reads as 
far as they would go while acquiring a flamegraph for debugging where we are 
spending time.
 * We were able to push the cluster to 60,000 coordinator RPS before we started 
seeing CPU queuing.
 * The limiting factor appeared to be on CPU time (~about 80% saturated) and 
random 4k IOP speed (although we were only ~30% saturated there)
 * Flamegraphs are attached

tpstats showed relatively little queueing or QOS issues, and local read 
latencies remained fast, so we believe that there is a different issue at play 
in the read path. Flamegraphs are attached for debugging.

*Fourth test, punish with reads and writes*

We're currently attempting a mixed mode test where we do many reads and writes 
and see how they interact. Results will be posted shortly. I think we'll need 
to bump our branch to pickup the latest changes.

*Summary*

So far this patch looks to be doing a great job, we have some issues to figure 
out with the reads and many more tests to run, but it didn't explode so that is 
good heh.

> Improvements to Internode Messaging
> -----------------------------------
>
>                 Key: CASSANDRA-15066
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15066
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Messaging/Internode
>            Reporter: Benedict
>            Assignee: Benedict
>            Priority: High
>             Fix For: 4.0
>
>         Attachments: 20k_backfill.png, 60k_RPS.png, 
> 60k_RPS_CPU_bottleneck.png, backfill_cass_perf_ft_msg_tst.svg, 
> baseline_patch_vs_30x.png, increasing_reads_latency.png, 
> many_reads_cass_perf_ft_msg_tst.svg
>
>
> CASSANDRA-8457 introduced asynchronous networking to internode messaging, but 
> there have been several follow-up endeavours to improve some semantic issues. 
>  CASSANDRA-14503 and CASSANDRA-13630 are the latest such efforts, and were 
> combined some months ago into a single overarching refactor of the original 
> work, to address some of the issues that have been discovered.  Given the 
> criticality of this work to the project, we wanted to bring some more eyes to 
> bear to ensure the release goes ahead smoothly.  In doing so, we uncovered a 
> number of issues with messaging, some of which long standing, that we felt 
> needed to be addressed.  This patch widens the scope of CASSANDRA-14503 and 
> CASSANDRA-13630 in an effort to close the book on the messaging service, at 
> least for the foreseeable future.
> The patch includes a number of clarifying refactors that touch outside of the 
> {{net.async}} package, and a number of semantic changes to the {{net.async}} 
> packages itself.  We believe it clarifies the intent and behaviour of the 
> code while improving system stability, which we will outline in comments 
> below.
> https://github.com/belliottsmith/cassandra/tree/messaging-improvements



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-15066) Improvements to Internode Messaging

Reply via email to