[ 
https://issues.apache.org/jira/browse/CASSANDRA-15452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17900511#comment-17900511
 ] 

Jon Haddad commented on CASSANDRA-15452:
----------------------------------------

Just tested the 5.0 branch.  Watching the filesystem I can see the compaction 
reads are all working as expected:

{noformat}
$ xfsslower 0 -p $(cassandra-pid) | awk '$4 == "R" { print $0 }' | grep 
CompactionExec
TIME        COMM                PID        T BYTES   OFF_KB   LAT(ms) FILENAME
20:30:25 CompactionExec 33477  R 262144  378880      2.20 nb-1246-big-Data.db
20:30:25 CompactionExec 33477  R 262144  409344      1.06 nb-1188-big-Data.db
20:30:25 CompactionExec 33477  R 262144  395008      0.71 nb-1208-big-Data.db
20:30:25 CompactionExec 33477  R 262144  404736      0.85 nb-1259-big-Data.db
20:30:25 CompactionExec 33477  R 262144  414976      0.45 nb-1187-big-Data.db
20:30:25 CompactionExec 33477  R 262144  405504      0.55 nb-1224-big-Data.db
20:30:25 CompactionExec 33477  R 262144  384000      0.92 nb-1210-big-Data.db
20:30:25 CompactionExec 33477  R 262144  346368      1.30 nb-1201-big-Data.db
20:30:25 CompactionExec 33477  R 262144  416256      0.90 nb-1292-big-Data.db
20:30:25 CompactionExec 33477  R 262144  416512      0.79 nb-1222-big-Data.db
20:30:25 CompactionExec 33477  R 262144  384256      0.12 nb-1280-big-Data.db
20:30:25 CompactionExec 33477  R 262144  363008      0.56 nb-1232-big-Data.db
20:30:25 CompactionExec 33477  R 262144  355072      1.67 nb-1216-big-Data.db
{noformat}

I set up a 3 node cluster with EBS and GP3, 16K IOPS, 1000MB throughput.  
Readahead is at 4KB, and loaded it up with 1TB of data, stopping compaction, 
and let sstables build up for a while across several tables.  I stopped my 
workload and let things chill for a bit, so we don't get any noise from running 
workloads, and nuked the page cache so we're actually hitting disk.

{noformat}
echo 3 | sudo tee /proc/sys/vm/drop_caches
{noformat}

Next I disabled throttling and ran nodetool compact, and after a while grabbed 
this from my dashboard:


 !screenshot-1.png! 

10.0.2.169 is the node running the patch, which is doing 3x the IO throughput 
as the other nodes.  It's also using considerably fewer IOPS:

 !screenshot-2.png! 

The spikes there are from the writes, which are able to hit higher throughput, 
because there's more room available for writes to flush.


> Improve disk access patterns during compaction (big format)
> -----------------------------------------------------------
>
>                 Key: CASSANDRA-15452
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15452
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Legacy/Local Write-Read Paths, Local/Compaction
>            Reporter: Jon Haddad
>            Assignee: Jordan West
>            Priority: Normal
>             Fix For: 4.1.x, 5.0.x, 5.x
>
>         Attachments: everyfs.txt, iostat-5.0-head.output, 
> iostat-5.0-patched.output, iostat-ebs-15452.png, iostat-ebs-head.png, 
> iostat-instance-15452.png, iostat-instance-head.png, results.txt, 
> screenshot-1.png, screenshot-2.png, sequential.fio, throughput-1.png, 
> throughput.png
>
>          Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> On read heavy workloads Cassandra performs much better when using a low read 
> ahead setting.   In my tests I've seen an 5x improvement in throughput and 
> more than a 50% reduction in latency.  However, I've also observed that it 
> can have a negative impact on compaction and streaming throughput. It 
> especially negatively impacts cloud environments where small reads incur high 
> costs in IOPS due to tiny requests.
>  # We should investigate using POSIX_FADV_DONTNEED on files we're compacting 
> to see if we can improve performance and reduce page faults. 
>  # This should be combined with an internal read ahead style buffer that 
> Cassandra manages, similar to a BufferedInputStream but with our own 
> machinery.  This buffer should read fairly large blocks of data off disk at 
> at time.  EBS, for example, allows 1 IOP to be up to 256KB.  A considerable 
> amount of time is spent in blocking I/O during compaction and streaming. 
> Reducing the frequency we read from disk should speed up all sequential I/O 
> operations.
>  # We can reduce system calls by buffering writes as well, but I think it 
> will have less of an impact than the reads



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to