[
https://issues.apache.org/jira/browse/CASSANDRA-21382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sam Lightfoot updated CASSANDRA-21382:
--------------------------------------
Description:
Follow-up: Direct I/O writes for zero-copy streaming receiver
*Summary*
Extend _background_write_disk_access_mode=direct_ (CASSANDRA-21134) to the ZCS
receiver. Today ZCS is zero-copy on the sender only; the receiver writes
through the page cache, evicting hot reads — the same problem CASSANDRA-21134
solved for compaction.
*Current state (after 21134)*
DIO engages only through {_}DataComponent.buildWriter(...){_}. Two receiver
paths exist:
|Path|Receiver writer|Through {_}DataComponent.buildWriter{_}?|DIO-eligible?|
|Chunked streaming ({_}CassandraStreamReader{_} /
{_}CassandraCompressedStreamReader{_})|_BigTableWriter_ / _BtiTableWriter_ →
_DataComponent.buildWriter(..., OperationType.STREAM, ...)_|Yes|Yes (compressed
tables)|
|Zero-copy streaming
({_}CassandraEntireSSTableStreamReader{_})|_SSTableZeroCopyWriter_ →
_ZeroCopySequentialWriter_ extends _SequentialWriter_|*No*|*No — always
buffered*|
*Why ZCS bypasses DIO today*
_ZeroCopySequentialWriter_ extends _SequentialWriter_ with no
{_}extraOpenOptions{_}. The channel opens with `READ + WRITE` only — a buffered
`FileChannel`. Every `writeDirectlyToChannel` call is a plain `channel.write`
through the page cache.
"Zero-copy" describes the sender (`sendfile` / `FileRegion`). The receiver is
buffered.
*Why this matters*
The CASSANDRA-21134 argument applies unchanged: streamed-in data is not
read-soon, but evicts hot working sets on its way through the page cache.
On bootstrap-heavy nodes ZCS is the primary ingestion path. Leaving it buffered
defeats half of CASSANDRA-21134 — compaction is cache-safe, the
higher-throughput stream still pollutes.
*Why it's not a trivial extension*
_DirectCompressedSequentialWriter_ sits inside the compression hierarchy, with
an aligned buffer between the chunk producer and the channel. ZCS has no such
hierarchy — bytes arrive pre-formed off the wire, in arbitrary sizes, for every
component (Data, Index, Statistics, CompressionInfo, Filter, Digest, CRC, TOC).
*Out of scope*
- Sender side — already zero-copy via _sendfile_ / {_}FileRegion{_}.
was:
Follow-up: Direct I/O writes for zero-copy streaming receiver
*Summary*
Extend _background_write_disk_access_mode=direct_ (CASSANDRA-21134) to the ZCS
receiver. Today ZCS is zero-copy on the sender only; the receiver writes
through the page cache, evicting hot reads — the same problem CASSANDRA-21134
solved for compaction.
*Current state (after 21134)*
DIO engages only through {_}DataComponent.buildWriter(...){_}. Two receiver
paths exist:
|Path|Receiver writer|Through {_}DataComponent.buildWriter{_}?|DIO-eligible?|
|Chunked streaming (_CassandraStreamReader_ /
_CassandraCompressedStreamReader_)|_BigTableWriter_ / _BtiTableWriter_ →
{_}DataComponent.buildWriter(..., OperationType.STREAM, ...){_}|Yes|Yes
(compressed tables)|
|Zero-copy streaming
(_CassandraEntireSSTableStreamReader_)|_SSTableZeroCopyWriter_ →
_ZeroCopySequentialWriter_ extends _SequentialWriter_|*No*|*No — always
buffered*|
*Why ZCS bypasses DIO today*
_ZeroCopySequentialWriter_ extends _SequentialWriter_ with no
{_}extraOpenOptions{_}. The channel opens with `READ + WRITE` only — a buffered
`FileChannel`. Every `writeDirectlyToChannel` call is a plain `channel.write`
through the page cache.
"Zero-copy" describes the sender (`sendfile` / `FileRegion`). The receiver is
buffered.
*Why this matters*
The CASSANDRA-21134 argument applies unchanged: streamed-in data is not
read-soon, but evicts hot working sets on its way through the page cache.
On bootstrap-heavy nodes ZCS is the primary ingestion path. Leaving it buffered
defeats half of CASSANDRA-21134 — compaction is cache-safe, the
higher-throughput stream still pollutes.
*Why it's not a trivial extension*
_DirectCompressedSequentialWriter_ sits inside the compression hierarchy, with
an aligned buffer between the chunk producer and the channel. ZCS has no such
hierarchy — bytes arrive pre-formed off the wire, in arbitrary sizes, for every
component (Data, Index, Statistics, CompressionInfo, Filter, Digest, CRC, TOC).
|Aspect|_DirectCompressedSequentialWriter_|ZCS receiver (proposed)|
|Writes|Compressed chunks at known sizes|Pre-formed bytes off the wire,
arbitrary sizes|
|Components|Data only|All — Data, Index, Statistics, CompressionInfo, Filter,
Digest, CRC, TOC|
|Alignment|Aligned buffer in the compression layer|Aligned wrapper at the
_SequentialWriter_ level|
|Policy fit|Strong — compaction output is cold|Same — streamed-in bytes are
cold|
|Per-component fit|Net win on Data|Mixed — small components pay overhead for
little cache benefit|
*Proposed approach*
# New writer _AlignedDirectSequentialWriter_ wrapping _SequentialWriter_ with
an aligned buffer. Payload-agnostic — operates on the byte stream, not on
compression chunks.
# Wire into {_}SSTableZeroCopyWriter.makeWriter{_} behind the existing
{_}getBackgroundWriteDiskAccessMode(){_} gate.
# Classify each component (same shape as
{_}DataComponent.buildDirectWriteSupport(){_}):
#* _SUPPORTED_: Data, Index, CompressionInfo — large, write-once, cold.
#* {_}UNSUPPORTED_POLICY{_}: Digest, CRC, TOC, Statistics, Filter, Summary —
small, possibly read-soon.
# Reuse the existing config gate. No new knob.
# Reuse the per-file {_}FileHandle.supportsDirectIO(){_} fallback.
*Open questions / risks*
- *Mixed-mode receivers*: splitting DIO and buffered components dilutes the
cache benefit. Acceptable — Data and Index carry the byte volume.
- *fsync*: replicate the post-truncate fsync fix from
_DirectCompressedSequentialWriter_ (see
{_}.claude/tasks/direct-io-writes/writer-context.md{_}, Issue 1).
- *Block-size detection*: per-component, as the chunked path does.
- *Tests*: clone _StreamingDirectWriteTest_ for ZCS
({_}stream_entire_sstables=true{_}), plus byte-equivalence against the buffered
baseline.
- *Throttle interaction*: verify DIO alignment overhead doesn't starve the
sender under {_}entire_sstable_stream_throughput_outbound{_}.
*Out of scope*
- Sender side — already zero-copy via _sendfile_ / _FileRegion_.
- Post-completion read-side DIO — covered by
{_}compaction_read_disk_access_mode{_}.
- On-disk format — unchanged.
*References*
- {_}src/java/org/apache/cassandra/io/sstable/SSTableZeroCopyWriter.java{_} —
current receiver writer.
- {_}src/java/org/apache/cassandra/io/util/SequentialWriter.java:116{_} —
{_}openChannel(file, extraOptions...){_}, the seam for
{_}ExtendedOpenOption.DIRECT{_}.
- {_}src/java/org/apache/cassandra/io/util/SequentialWriter.java:473{_} —
{_}writeDirectlyToChannel(ByteBuffer){_}, the per-write path.
-
{_}src/java/org/apache/cassandra/io/compress/DirectCompressedSequentialWriter.java{_}
— reference for aligned buffer, fsync, cleanup.
- {_}src/java/org/apache/cassandra/io/sstable/format/DataComponent.java:58{_}
— {_}buildDirectWriteSupport(){_}, per-component classification pattern.
-
{_}src/java/org/apache/cassandra/db/streaming/CassandraEntireSSTableStreamReader.java{_}
— sender side; the receiver routes through _SSTableZeroCopyWriter_.
- CASSANDRA-21134 — parent ticket.
> Direct IO support for ZCS writes
> ---------------------------------
>
> Key: CASSANDRA-21382
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21382
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: Local/SSTable
> Reporter: Sam Lightfoot
> Assignee: Sam Lightfoot
> Priority: Normal
> Fix For: 6.x
>
>
> Follow-up: Direct I/O writes for zero-copy streaming receiver
> *Summary*
> Extend _background_write_disk_access_mode=direct_ (CASSANDRA-21134) to the
> ZCS receiver. Today ZCS is zero-copy on the sender only; the receiver writes
> through the page cache, evicting hot reads — the same problem CASSANDRA-21134
> solved for compaction.
> *Current state (after 21134)*
> DIO engages only through {_}DataComponent.buildWriter(...){_}. Two receiver
> paths exist:
> |Path|Receiver writer|Through {_}DataComponent.buildWriter{_}?|DIO-eligible?|
> |Chunked streaming ({_}CassandraStreamReader{_} /
> {_}CassandraCompressedStreamReader{_})|_BigTableWriter_ / _BtiTableWriter_ →
> _DataComponent.buildWriter(..., OperationType.STREAM, ...)_|Yes|Yes
> (compressed tables)|
> |Zero-copy streaming
> ({_}CassandraEntireSSTableStreamReader{_})|_SSTableZeroCopyWriter_ →
> _ZeroCopySequentialWriter_ extends _SequentialWriter_|*No*|*No — always
> buffered*|
> *Why ZCS bypasses DIO today*
> _ZeroCopySequentialWriter_ extends _SequentialWriter_ with no
> {_}extraOpenOptions{_}. The channel opens with `READ + WRITE` only — a
> buffered `FileChannel`. Every `writeDirectlyToChannel` call is a plain
> `channel.write` through the page cache.
> "Zero-copy" describes the sender (`sendfile` / `FileRegion`). The receiver is
> buffered.
> *Why this matters*
> The CASSANDRA-21134 argument applies unchanged: streamed-in data is not
> read-soon, but evicts hot working sets on its way through the page cache.
> On bootstrap-heavy nodes ZCS is the primary ingestion path. Leaving it
> buffered defeats half of CASSANDRA-21134 — compaction is cache-safe, the
> higher-throughput stream still pollutes.
> *Why it's not a trivial extension*
> _DirectCompressedSequentialWriter_ sits inside the compression hierarchy,
> with an aligned buffer between the chunk producer and the channel. ZCS has no
> such hierarchy — bytes arrive pre-formed off the wire, in arbitrary sizes,
> for every component (Data, Index, Statistics, CompressionInfo, Filter,
> Digest, CRC, TOC).
> *Out of scope*
> - Sender side — already zero-copy via _sendfile_ / {_}FileRegion{_}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]