[
https://issues.apache.org/jira/browse/CASSANDRA-21134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063567#comment-18063567
]
Sam Lightfoot edited comment on CASSANDRA-21134 at 3/6/26 3:07 PM:
-------------------------------------------------------------------
h3. Block I/O Latency: Compaction Writes at the Device Level
To understand _why_ DIO improves read tail latency, we captured block I/O
latency histograms using {{biolatency-bpfcc}} (BPF) during compaction. This
traces every I/O at the NVMe device driver - below the page cache & filesystem.
*Setup:* 2 × 65 GB SSTables, major compaction at 128 MiB/s, 10K reads/s, 12 GB
cgroup, RAID1 NVMe, chunk_length_kb=4. 30-second capture during steady-state
compaction.
h4. Buffered compaction write I/Os (writeback)
With buffered I/O, compaction writes enter the page cache and are flushed to
disk asynchronously by the kernel's writeback daemon. These appear as {{Write}}
flag I/Os:
{noformat}
usecs : count distribution
8 -> 15 : 134 |* |
16 -> 31 : 276 |**** |
32 -> 63 : 199 |** |
64 -> 127 : 223 |*** |
128 -> 255 : 344 |***** |
256 -> 511 : 486 |******* |
512 -> 1023 : 928 |************* |
1024 -> 2047 : 1,110 |**************** |
2048 -> 4095 : 1,608 |*********************** |
4096 -> 8191 : 2,476 |************************************ |
8192 -> 16383 : 1,528 |********************** |
16384 -> 32767 : 1,809 |************************** |
32768 -> 65535 : 2,722 |**************************************** |
<-- mode
65536 -> 131071 : 738 |********** |
{noformat}
*14,581 writeback I/Os. Mode at 32–65 ms. 77% exceed 1 ms. Spread across 5
orders of magnitude.*
h4. DIO compaction write I/Os (O_DIRECT)
With DIO, compaction writes go directly to the device as {{Sync-Write}} I/Os,
bypassing the page cache entirely:
{noformat}
usecs : count distribution
8 -> 15 : 20 | |
16 -> 31 : 72 | |
32 -> 63 : 984 |* |
64 -> 127 : 31,424 |****************************************|
<-- mode
1024 -> 2047 : 1 | |
4096 -> 8191 : 1 | |
{noformat}
*32,502 Sync-Write I/Os. Mode at 64–127 us (~500x faster). 97% complete within
127 us.*
h4. Impact on reads
The buffered writeback I/Os (mode 32–65 ms) saturate the device's write
bandwidth, causing user read I/Os to queue behind them. During the 30-second
capture:
- *Buffered:* 5,729 read I/Os exceeded 2 ms (3.8% of all reads reaching disk),
max ~32 ms
- *DIO:* 8 read I/Os exceeded 2 ms (0.04%), max ~16 ms
This writeback-induced read queueing is the primary mechanism behind the p99
latency difference observed in the application-level results.
was (Author: JIRAUSER302824):
h3. Block I/O Latency: Compaction Writes at the Device Level
To understand _why_ DIO improves read tail latency, we captured block I/O
latency histograms using {{biolatency-bpfcc}} (BPF) during compaction. This
traces every I/O at the NVMe device driver — below the page cache, below the
filesystem.
*Setup:* 2 × 65 GB SSTables, major compaction at 128 MiB/s, 10K reads/s, 12 GB
cgroup, RAID1 NVMe, chunk_length_kb=4. 30-second capture during steady-state
compaction.
h4. Buffered compaction write I/Os (writeback)
With buffered I/O, compaction writes enter the page cache and are flushed to
disk asynchronously by the kernel's writeback daemon. These appear as {{Write}}
flag I/Os:
{noformat}
usecs : count distribution
8 -> 15 : 134 |* |
16 -> 31 : 276 |**** |
32 -> 63 : 199 |** |
64 -> 127 : 223 |*** |
128 -> 255 : 344 |***** |
256 -> 511 : 486 |******* |
512 -> 1023 : 928 |************* |
1024 -> 2047 : 1,110 |**************** |
2048 -> 4095 : 1,608 |*********************** |
4096 -> 8191 : 2,476 |************************************ |
8192 -> 16383 : 1,528 |********************** |
16384 -> 32767 : 1,809 |************************** |
32768 -> 65535 : 2,722 |**************************************** |
<-- mode
65536 -> 131071 : 738 |********** |
{noformat}
*14,581 writeback I/Os. Mode at 32–65 ms. 77% exceed 1 ms. Spread across 5
orders of magnitude.*
h4. DIO compaction write I/Os (O_DIRECT)
With DIO, compaction writes go directly to the device as {{Sync-Write}} I/Os,
bypassing the page cache entirely:
{noformat}
usecs : count distribution
8 -> 15 : 20 | |
16 -> 31 : 72 | |
32 -> 63 : 984 |* |
64 -> 127 : 31,424 |****************************************|
<-- mode
1024 -> 2047 : 1 | |
4096 -> 8191 : 1 | |
{noformat}
*32,502 Sync-Write I/Os. Mode at 64–127 us (~500x faster). 97% complete within
127 us.*
h4. Impact on reads
The buffered writeback I/Os (mode 32–65 ms) saturate the device's write
bandwidth, causing user read I/Os to queue behind them. During the 30-second
capture:
- *Buffered:* 5,729 read I/Os exceeded 2 ms (3.8% of all reads reaching disk),
max ~32 ms
- *DIO:* 8 read I/Os exceeded 2 ms (0.04%), max ~16 ms
This writeback-induced read queueing is the primary mechanism behind the p99
latency difference observed in the application-level results.
> Direct IO support for compaction writes
> ---------------------------------------
>
> Key: CASSANDRA-21134
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21134
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: Local/Compaction
> Reporter: Sam Lightfoot
> Assignee: Sam Lightfoot
> Priority: Normal
> Fix For: 5.x
>
> Attachments: image-2026-02-11-17-22-58-361.png,
> image-2026-02-11-17-25-58-329.png
>
>
> Follow-up from the implementation for compaction reads (CASSANDRA-19987)
> Notable points
> * Update the start-up check that impacts DIO writes
> ({_}checkKernelBug1057843{_})
> * RocksDB uses 1 MB flush buffer. This should be configurable and
> performance tested (256KB vs 1MB)
> * Introduce compaction_write_disk_access_mode /
> backgroud_write_disk_access_mode
> * Support for the compressed path would be most beneficial
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]