[
https://issues.apache.org/jira/browse/CASSANDRA-21393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sam Lightfoot updated CASSANDRA-21393:
--------------------------------------
Fix Version/s: (was: 6.x)
> Investigate: Add FDP (Flexible Data Placement) hints to write paths to reduce
> SSD write amplification
> -----------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-21393
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21393
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: Local/SSTable
> Reporter: Sam Lightfoot
> Assignee: Sam Lightfoot
> Priority: Normal
>
> Cassandra's write paths (commitlog, memtable flush, compaction output, hints,
> system tables) currently issue writes to disk without any indication that
> they belong to distinct streams with very different expected lifetimes. On
> modern enterprise SSDs, this causes the device to interleave data with mixed
> deathtimes into the same physical superblock, which inflates the SSD's
> internal write amplification (typically 1.9-2.7x on enterprise NVMe under
> realistic skewed workloads) as the SSD's internal garbage collector must
> relocate still-valid pages when reclaiming space.
> NVMe Flexible Data Placement (FDP, NVMe TP 4146, ratified late 2022) lets the
> host attach an 8-bit Placement Identifier to each write, which the device
> uses to route writes from different streams into separate Reclaim Unit
> Handles (RUHs). When streams with similar deathtimes share an RUH and streams
> with different deathtimes are kept separate, the SSD's internal GC observes
> superblocks that become fully invalid as a unit, driving SSD WAF toward 1.
> Cassandra is well-positioned to benefit from FDP because its write streams
> have naturally distinct lifetime characteristics that the storage layer
> cannot infer on its own:
> * Commitlog segments are deleted on a rolling schedule decoupled from any
> SSTable.
> * Memtable flushes produce L0 SSTables with very short expected lifetimes.
> * Compaction outputs at higher levels live progressively longer.
> * Hints, system tables, and repair streams have their own distinct rewrite
> cadences.
> Mixing these at the device layer is pure SSD-WAF cost with no upside. Recent
> work in the database community (Lee et al., VLDB 2026, "How to Write to
> SSDs") demonstrates that exposing this kind of host-side workload knowledge
> to the SSD via FDP can eliminate SSD-level write amplification on commodity
> devices, with corresponding gains in throughput and SSD endurance.
> Reference: How to Write to SSDs:
> [https://www.vldb.org/pvldb/vol19/p1469-lee.pdf]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]