[ 
https://issues.apache.org/jira/browse/CASSANDRA-21393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Lightfoot updated CASSANDRA-21393:
--------------------------------------
    Description: 
Cassandra's write paths (commitlog, memtable flush, compaction output, hints, 
system tables) currently issue writes to disk without any indication that they 
belong to distinct streams with very different expected lifetimes. On modern 
enterprise SSDs, this causes the device to interleave data with mixed 
deathtimes into the same physical superblock, which inflates the SSD's internal 
write amplification (typically 1.9-2.7x on enterprise NVMe under realistic 
skewed workloads) as the SSD's internal garbage collector must relocate 
still-valid pages when reclaiming space.

NVMe Flexible Data Placement (FDP, NVMe TP 4146, ratified late 2022) lets the 
host attach an 8-bit Placement Identifier to each write, which the device uses 
to route writes from different streams into separate Reclaim Unit Handles 
(RUHs). When streams with similar deathtimes share an RUH and streams with 
different deathtimes are kept separate, the SSD's internal GC observes 
superblocks that become fully invalid as a unit, driving SSD WAF toward 1.

Cassandra is well-positioned to benefit from FDP because its write streams have 
naturally distinct lifetime characteristics that the storage layer cannot infer 
on its own:
 * Commitlog segments are deleted on a rolling schedule decoupled from any 
SSTable.
 * Memtable flushes produce L0 SSTables with very short expected lifetimes.
 * Compaction outputs at higher levels live progressively longer.
 * Hints, system tables, and repair streams have their own distinct rewrite 
cadences.

Mixing these at the device layer is pure SSD-WAF cost with no upside. Recent 
work in the database community (Lee et al., VLDB 2026, "How to Write to SSDs") 
demonstrates that exposing this kind of host-side workload knowledge to the SSD 
via FDP can eliminate SSD-level write amplification on commodity devices, with 
corresponding gains in throughput and SSD endurance.

Reference: How to Write to SSDs: 
[https://www.vldb.org/pvldb/vol19/p1469-lee.pdf]

  was:
Follow-up from the implementation for compaction reads (CASSANDRA-19987)

Notable points
 * Update the start-up check that impacts DIO writes 
({_}checkKernelBug1057843{_})
 * RocksDB uses 1 MB flush buffer. This should be configurable and performance 
tested (256KB vs 1MB)
 * Introduce compaction_write_disk_access_mode / 
backgroud_write_disk_access_mode
 * Support for the compressed path would be most beneficial

 
|OperationType|Writes Data|Direct IO|Reason|
|COMPACTION|Yes|Yes| |
|TOMBSTONE_COMPACTION|Yes|Yes| |
|MAJOR_COMPACTION|Yes|Yes| |
|CLEANUP|Yes|Yes| |
|UPGRADE_SSTABLES|Yes|Yes| |
|GARBAGE_COLLECT|Yes|Yes| |
|ANTICOMPACTION|Yes|Yes| |
|WRITE|Yes|Yes| |
|STREAM|Yes|Yes|Supersedes CASSANDRA-20087|
|SCRUB|Yes|No|Maybe: uses `mark()`/`resetAndTruncate()` for corrupt rollback. 
Requires additional plumbing.|
|FLUSH|Yes|No|Excluded: flushed data benefits from page cache due to read 
recency|


> Investigate: Add FDP (Flexible Data Placement) hints to write paths to reduce 
> SSD write amplification
> -----------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21393
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21393
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/SSTable
>            Reporter: Sam Lightfoot
>            Assignee: Sam Lightfoot
>            Priority: Normal
>             Fix For: 6.x
>
>
> Cassandra's write paths (commitlog, memtable flush, compaction output, hints, 
> system tables) currently issue writes to disk without any indication that 
> they belong to distinct streams with very different expected lifetimes. On 
> modern enterprise SSDs, this causes the device to interleave data with mixed 
> deathtimes into the same physical superblock, which inflates the SSD's 
> internal write amplification (typically 1.9-2.7x on enterprise NVMe under 
> realistic skewed workloads) as the SSD's internal garbage collector must 
> relocate still-valid pages when reclaiming space.
> NVMe Flexible Data Placement (FDP, NVMe TP 4146, ratified late 2022) lets the 
> host attach an 8-bit Placement Identifier to each write, which the device 
> uses to route writes from different streams into separate Reclaim Unit 
> Handles (RUHs). When streams with similar deathtimes share an RUH and streams 
> with different deathtimes are kept separate, the SSD's internal GC observes 
> superblocks that become fully invalid as a unit, driving SSD WAF toward 1.
> Cassandra is well-positioned to benefit from FDP because its write streams 
> have naturally distinct lifetime characteristics that the storage layer 
> cannot infer on its own:
>  * Commitlog segments are deleted on a rolling schedule decoupled from any 
> SSTable.
>  * Memtable flushes produce L0 SSTables with very short expected lifetimes.
>  * Compaction outputs at higher levels live progressively longer.
>  * Hints, system tables, and repair streams have their own distinct rewrite 
> cadences.
> Mixing these at the device layer is pure SSD-WAF cost with no upside. Recent 
> work in the database community (Lee et al., VLDB 2026, "How to Write to 
> SSDs") demonstrates that exposing this kind of host-side workload knowledge 
> to the SSD via FDP can eliminate SSD-level write amplification on commodity 
> devices, with corresponding gains in throughput and SSD endurance.
> Reference: How to Write to SSDs: 
> [https://www.vldb.org/pvldb/vol19/p1469-lee.pdf]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to