[jira] [Commented] (CASSANDRA-15452) Improve disk access patterns during compaction and range reads

Jon Haddad (Jira) Tue, 07 Jan 2025 16:21:04 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17910851#comment-17910851
 ]


Jon Haddad commented on CASSANDRA-15452:
----------------------------------------

Re-evaluated the most recent iteration of the patch. TL;DR: Everything's still 
looking good so far.  

I created an AMI using the following easy-cass-lab configuration in 
cassandra_versions.yaml:
{noformat}
- version: "5.0-15452"
  url: https://github.com/jrwest/cassandra.git
  branch: 15452-5.0
  java: "11"
  java_build: "11"
  python: "3.10.6"
  axonops: "5.0"
  jvm_options: jvm11-server.options
{noformat}
To build the AMI:
{noformat}
ecl build-image --cpu arm64
{noformat}
Cluster setup. First 2 nodes are using 5.0 release, cassandra2 is using the 
patch:
{noformat}
ecl init -c 3 -s 1 -i r7g.4xlarge --si r7g.4xlarge \
--cpu arm64 --ebs.type gp3 --ebs.optimized --ebs.iops 16000 \
--ebs.throughput 1000 --ebs.size 16000 a15452-$(date +%s) --up
source env.sh
ecl use 5.0
ecl use 5.0-15452 --hosts cassandra2
ecl start
{noformat}
I've started a load test with this:
{noformat}
easy-cass-stress run RandomPartitionAccess -p 10m --workload.rows=10000 -r .1 
--populate 10m --compaction lcs --rate 100k -d 1d --maxwlat 50
{noformat}
While the test was running early on with only a couple hundred GB, I looked at 
the metrics. The systems were more or less the same with almost the every read 
coming out of page cache. I stopped compaction for a bit, letting pending build 
up to about 150 pending compactions on each node, then busted the page cache 
with this:
{noformat}
c-all "echo 3 | sudo tee /proc/sys/vm/drop_caches"
Executing on cassandra0
3
Executing on cassandra1
3
Executing on cassandra2
3
{noformat}
and re-enabled compaction without a throttle:
{noformat}
$ c-all "nodetool setcompactionthroughput 0"
Executing on cassandra0
Executing on cassandra1
Executing on cassandra2
$ c-all "nodetool enableautocompaction"
Executing on cassandra0
Executing on cassandra1
Executing on cassandra2
{noformat}
You can see below the read throughput for the blue line has hit it's limit of 
16K IOPS, around 200MB/s. The drive is configured to deliver up to 1GB/s but 
b/c of the tiny reads, we prematurely hit the limit.

The node using the internal read ahead buffer (white line) is doing 
significantly more throughput, while using fewer IOPS.

!screenshot-5.png|width=852,height=150!

!image-2025-01-07-16-04-23-909.png|width=862,height=280!

 

Will post additional info as the test runs.

 

 

 

> Improve disk access patterns during compaction and range reads
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-15452
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15452
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Legacy/Local Write-Read Paths, Local/Compaction
>            Reporter: Jon Haddad
>            Assignee: Jordan West
>            Priority: Normal
>             Fix For: 4.1.x, 5.0.x, 5.x
>
>         Attachments: everyfs.txt, image-2024-11-22-16-17-23-194.png, 
> image-2025-01-07-16-04-23-909.png, iostat-5.0-head.output, 
> iostat-5.0-patched.output, iostat-ebs-15452.png, iostat-ebs-head.png, 
> iostat-instance-15452.png, iostat-instance-head.png, results.txt, 
> screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png, 
> screenshot-5.png, screenshot-6.png, sequential.fio, throughput-1.png, 
> throughput.png
>
>          Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> On read heavy workloads Cassandra performs much better when using a low read 
> ahead setting.   In my tests I've seen an 5x improvement in throughput and 
> more than a 50% reduction in latency.  However, I've also observed that it 
> can have a negative impact on compaction and streaming throughput. It 
> especially negatively impacts cloud environments where small reads incur high 
> costs in IOPS due to tiny requests.
>  # We should investigate using POSIX_FADV_DONTNEED on files we're compacting 
> to see if we can improve performance and reduce page faults. 
>  # This should be combined with an internal read ahead style buffer that 
> Cassandra manages, similar to a BufferedInputStream but with our own 
> machinery.  This buffer should read fairly large blocks of data off disk at 
> at time.  EBS, for example, allows 1 IOP to be up to 256KB.  A considerable 
> amount of time is spent in blocking I/O during compaction and streaming. 
> Reducing the frequency we read from disk should speed up all sequential I/O 
> operations.
>  # We can reduce system calls by buffering writes as well, but I think it 
> will have less of an impact than the reads



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-15452) Improve disk access patterns during compaction and range reads

Reply via email to