[
https://issues.apache.org/jira/browse/CASSANDRA-21393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sam Lightfoot updated CASSANDRA-21393:
--------------------------------------
Description:
Cassandra's write paths (commitlog, memtable flush, compaction output, hints,
system tables) currently issue writes to disk without any indication that they
belong to distinct streams with very different expected lifetimes. On modern
enterprise SSDs, this causes the device to interleave data with mixed
deathtimes into the same physical superblock, which inflates the SSD's internal
write amplification (typically 1.9-2.7x on enterprise NVMe under realistic
skewed workloads) as the SSD's internal garbage collector must relocate
still-valid pages when reclaiming space.
NVMe Flexible Data Placement (FDP, NVMe TP 4146, ratified late 2022) lets the
host attach an 8-bit Placement Identifier to each write, which the device uses
to route writes from different streams into separate Reclaim Unit Handles
(RUHs). When streams with similar deathtimes share an RUH and streams with
different deathtimes are kept separate, the SSD's internal GC observes
superblocks that become fully invalid as a unit, driving SSD WAF toward 1.
Cassandra is well-positioned to benefit from FDP because its write streams have
naturally distinct lifetime characteristics that the storage layer cannot infer
on its own:
* Commitlog segments are deleted on a rolling schedule decoupled from any
SSTable.
* Memtable flushes produce L0 SSTables with very short expected lifetimes.
* Compaction outputs at higher levels live progressively longer.
* Hints, system tables, and repair streams have their own distinct rewrite
cadences.
Mixing these at the device layer is pure SSD-WAF cost with no upside. Recent
work in the database community (Lee et al., VLDB 2026, "How to Write to SSDs")
demonstrates that exposing this kind of host-side workload knowledge to the SSD
via FDP can eliminate SSD-level write amplification on commodity devices, with
corresponding gains in throughput and SSD endurance.
Reference: How to Write to SSDs:
[https://www.vldb.org/pvldb/vol19/p1469-lee.pdf]
was:
Follow-up from the implementation for compaction reads (CASSANDRA-19987)
Notable points
* Update the start-up check that impacts DIO writes
({_}checkKernelBug1057843{_})
* RocksDB uses 1 MB flush buffer. This should be configurable and performance
tested (256KB vs 1MB)
* Introduce compaction_write_disk_access_mode /
backgroud_write_disk_access_mode
* Support for the compressed path would be most beneficial
|OperationType|Writes Data|Direct IO|Reason|
|COMPACTION|Yes|Yes| |
|TOMBSTONE_COMPACTION|Yes|Yes| |
|MAJOR_COMPACTION|Yes|Yes| |
|CLEANUP|Yes|Yes| |
|UPGRADE_SSTABLES|Yes|Yes| |
|GARBAGE_COLLECT|Yes|Yes| |
|ANTICOMPACTION|Yes|Yes| |
|WRITE|Yes|Yes| |
|STREAM|Yes|Yes|Supersedes CASSANDRA-20087|
|SCRUB|Yes|No|Maybe: uses `mark()`/`resetAndTruncate()` for corrupt rollback.
Requires additional plumbing.|
|FLUSH|Yes|No|Excluded: flushed data benefits from page cache due to read
recency|
> Investigate: Add FDP (Flexible Data Placement) hints to write paths to reduce
> SSD write amplification
> -----------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-21393
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21393
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: Local/SSTable
> Reporter: Sam Lightfoot
> Assignee: Sam Lightfoot
> Priority: Normal
> Fix For: 6.x
>
>
> Cassandra's write paths (commitlog, memtable flush, compaction output, hints,
> system tables) currently issue writes to disk without any indication that
> they belong to distinct streams with very different expected lifetimes. On
> modern enterprise SSDs, this causes the device to interleave data with mixed
> deathtimes into the same physical superblock, which inflates the SSD's
> internal write amplification (typically 1.9-2.7x on enterprise NVMe under
> realistic skewed workloads) as the SSD's internal garbage collector must
> relocate still-valid pages when reclaiming space.
> NVMe Flexible Data Placement (FDP, NVMe TP 4146, ratified late 2022) lets the
> host attach an 8-bit Placement Identifier to each write, which the device
> uses to route writes from different streams into separate Reclaim Unit
> Handles (RUHs). When streams with similar deathtimes share an RUH and streams
> with different deathtimes are kept separate, the SSD's internal GC observes
> superblocks that become fully invalid as a unit, driving SSD WAF toward 1.
> Cassandra is well-positioned to benefit from FDP because its write streams
> have naturally distinct lifetime characteristics that the storage layer
> cannot infer on its own:
> * Commitlog segments are deleted on a rolling schedule decoupled from any
> SSTable.
> * Memtable flushes produce L0 SSTables with very short expected lifetimes.
> * Compaction outputs at higher levels live progressively longer.
> * Hints, system tables, and repair streams have their own distinct rewrite
> cadences.
> Mixing these at the device layer is pure SSD-WAF cost with no upside. Recent
> work in the database community (Lee et al., VLDB 2026, "How to Write to
> SSDs") demonstrates that exposing this kind of host-side workload knowledge
> to the SSD via FDP can eliminate SSD-level write amplification on commodity
> devices, with corresponding gains in throughput and SSD endurance.
> Reference: How to Write to SSDs:
> [https://www.vldb.org/pvldb/vol19/p1469-lee.pdf]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]