[
https://issues.apache.org/jira/browse/CASSANDRA-15066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827395#comment-16827395
]
Joseph Lynch edited comment on CASSANDRA-15066 at 4/27/19 12:50 AM:
--------------------------------------------------------------------
Note, this is not a comparative analysis nor do we have root causes for all
findings. [~vinaykumarcse] and my goal today was to kick the tires of this
patch and see if there were any serious issues. We threw
[{{9b0b814add}}|https://github.com/apache/cassandra/commit/9b0b814add8da5a66c12a87a7bfebb015667f293]
on a smallish cluster and punished it with read and write load.
*Test setup*
* Two datacenters, approximately 70ms apart
* 6 {{i3.2xlarge}} nodes per datacenter (4 physical cores with 8 threads,
60GiB of memory, 1.8 TB NVMe dedicated drive, 2 Gbps network)
* 3 node NDBench cluster generating {{LOCAL_ONE }} random writes and full
partition reads of ~4kb partitions consisting of 2 rows of 10 columns each.
Total dataset per node was about ~180GiB and reads were uniformly distributed
across the partition space. The table was mostly defaults (RF=3) except it used
Leveled Compaction Strategy and no compression (since the data is random).
*First test, bootstrap (aka punish with writes)*
In this test we used NDBench's backfill feature to punish the cluster with
writes.
* Backfilling the dataset achieved sustained write throughput of 20k
coordinator level WPS easily, with average latencies staying below 1ms
* The limiting factor appeared to be compaction throughput
* Flamegraphs are attached
There were no observed hints or dropped messages and datasizes in both
datacenters looked reasonably consistent. I think this went very well.
!20k_backfill.png|thumbnail!
*Second test, establish baseline*
Next, we sent a reasonably modest 1200 coordinator RPS and 600 WPS, which is
very light load, and compared this patch to our production 30x production
branch.
* Writes are ~20%faster, like we saw previously in netty trunk vs 30x
* Reads are *~500%* slower, this is new since our last tests and from the
flamegraph [~benedict] suspects and I agree that it was likely related to some
of the TR cleanup
* Checked the virtual table metrics and they seem reasonable, also spot
checked some of the new jmx per channel metrics
Summary: The read latency is concerning, but I think Benedict may already have
the fix.
*Third test, punish with reads*
Due to the poor baseline read performance, we attempted to push the reads as
far as they would go while acquiring a flamegraph for debugging where we are
spending time.
* We were able to push the cluster to 60,000 coordinator RPS before we started
seeing CPU queuing.
* The limiting factor appeared to be on CPU time (~about 80% saturated) and
random 4k IOP speed (although we were only ~30% saturated there)
* Flamegraphs are attached
tpstats showed relatively little queueing or QOS issues, and local read
latencies remained fast, so we believe that there is a different issue at play
in the read path. Flamegraphs are attached for debugging.
*Fourth test, punish with reads and writes*
We're currently attempting a mixed mode test where we do many reads and writes
and see how they interact. Results will be posted shortly. I think we'll need
to bump our branch to pickup the latest changes.
*Summary*
So far this patch looks to be doing a great job, we have some issues to figure
out with the reads and many more tests to run, but it didn't explode so that is
good heh.
was (Author: jolynch):
Note, this is not a comparative analysis nor do we have root causes for all
findings. [~vinaykumarcse] and my goal today was to kick the tires of this
patch and see if there were any serious issues. We threw
[{{9b0b814add}}|https://github.com/apache/cassandra/commit/9b0b814add8da5a66c12a87a7bfebb015667f293]
on a smallish cluster and punished it with read and write load.
*Test setup*
* Two datacenters, approximately 70ms apart
* 6 {{i3.2xlarge}} nodes per datacenter (4 physical cores with 8 threads,
60GiB of memory, 1.8 TB NVMe dedicated drive, 2 Gbps network)
* 3 node NDBench cluster generating {{LOCAL_ONE }} random writes and full
partition reads of ~4kb partitions consisting of 2 rows of 10 columns each.
Total dataset per node was about ~180GiB and reads were uniformly distributed
across the partition space. The table was mostly defaults (RF=3) except it used
Leveled Compaction Strategy and no compression (since the data is random).
*First test, bootstrap (aka punish with writes)*
In this test we used NDBench's backfill feature to punish the cluster with
writes.
* Backfilling the dataset achieved sustained write throughput of 20k
coordinator level WPS easily, with average latencies staying below 1ms
* The limiting factor appeared to be compaction throughput
* Flamegraphs are attached
There were no observed hints or dropped messages and datasizes in both
datacenters looked reasonably consistent. I think this went very well.
*Second test, establish baseline*
Next, we sent a reasonably modest 1200 coordinator RPS and 600 WPS, which is
very light load, and compared this patch to our production 30x production
branch.
* Writes are ~20%faster, like we saw previously in netty trunk vs 30x
* Reads are *~500%* slower, this is new since our last tests and from the
flamegraph [~benedict] suspects and I agree that it was likely related to some
of the TR cleanup
* Checked the virtual table metrics and they seem reasonable, also spot
checked some of the new jmx per channel metrics
Summary: The read latency is concerning, but I think Benedict may already have
the fix.
*Third test, punish with reads*
Due to the poor baseline read performance, we attempted to push the reads as
far as they would go while acquiring a flamegraph for debugging where we are
spending time.
* We were able to push the cluster to 60,000 coordinator RPS before we started
seeing CPU queuing.
* The limiting factor appeared to be on CPU time (~about 80% saturated) and
random 4k IOP speed (although we were only ~30% saturated there)
* Flamegraphs are attached
tpstats showed relatively little queueing or QOS issues, and local read
latencies remained fast, so we believe that there is a different issue at play
in the read path. Flamegraphs are attached for debugging.
*Fourth test, punish with reads and writes*
We're currently attempting a mixed mode test where we do many reads and writes
and see how they interact. Results will be posted shortly. I think we'll need
to bump our branch to pickup the latest changes.
*Summary*
So far this patch looks to be doing a great job, we have some issues to figure
out with the reads and many more tests to run, but it didn't explode so that is
good heh.
> Improvements to Internode Messaging
> -----------------------------------
>
> Key: CASSANDRA-15066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15066
> Project: Cassandra
> Issue Type: Improvement
> Components: Messaging/Internode
> Reporter: Benedict
> Assignee: Benedict
> Priority: High
> Fix For: 4.0
>
> Attachments: 20k_backfill.png, 60k_RPS.png,
> 60k_RPS_CPU_bottleneck.png, backfill_cass_perf_ft_msg_tst.svg,
> baseline_patch_vs_30x.png, increasing_reads_latency.png,
> many_reads_cass_perf_ft_msg_tst.svg
>
>
> CASSANDRA-8457 introduced asynchronous networking to internode messaging, but
> there have been several follow-up endeavours to improve some semantic issues.
> CASSANDRA-14503 and CASSANDRA-13630 are the latest such efforts, and were
> combined some months ago into a single overarching refactor of the original
> work, to address some of the issues that have been discovered. Given the
> criticality of this work to the project, we wanted to bring some more eyes to
> bear to ensure the release goes ahead smoothly. In doing so, we uncovered a
> number of issues with messaging, some of which long standing, that we felt
> needed to be addressed. This patch widens the scope of CASSANDRA-14503 and
> CASSANDRA-13630 in an effort to close the book on the messaging service, at
> least for the foreseeable future.
> The patch includes a number of clarifying refactors that touch outside of the
> {{net.async}} package, and a number of semantic changes to the {{net.async}}
> packages itself. We believe it clarifies the intent and behaviour of the
> code while improving system stability, which we will outline in comments
> below.
> https://github.com/belliottsmith/cassandra/tree/messaging-improvements
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]