[ 
https://issues.apache.org/jira/browse/CASSANDRA-21382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Lightfoot updated CASSANDRA-21382:
--------------------------------------
    Description: 
Follow-up: Direct I/O writes for zero-copy streaming receiver

*Summary*

Extend _background_write_disk_access_mode=direct_ (CASSANDRA-21134) to the ZCS 
receiver. Today ZCS is zero-copy on the sender only; the receiver writes 
through the page cache, evicting hot reads — the same problem CASSANDRA-21134 
solved for compaction.

*Current state (after 21134)*

DIO engages only through {_}DataComponent.buildWriter(...){_}. Two receiver 
paths exist:

|Path|Receiver writer|Through {_}DataComponent.buildWriter{_}?|DIO-eligible?|
|Chunked streaming (_CassandraStreamReader_ / 
_CassandraCompressedStreamReader_)|_BigTableWriter_ / _BtiTableWriter_ → 
{_}DataComponent.buildWriter(..., OperationType.STREAM, ...){_}|Yes|Yes 
(compressed tables)|
|Zero-copy streaming 
(_CassandraEntireSSTableStreamReader_)|_SSTableZeroCopyWriter_ → 
_ZeroCopySequentialWriter_ extends _SequentialWriter_|*No*|*No — always 
buffered*|

*Why ZCS bypasses DIO today*

_ZeroCopySequentialWriter_ extends _SequentialWriter_ with no 
{_}extraOpenOptions{_}. The channel opens with `READ + WRITE` only — a buffered 
`FileChannel`. Every `writeDirectlyToChannel` call is a plain `channel.write` 
through the page cache.

"Zero-copy" describes the sender (`sendfile` / `FileRegion`). The receiver is 
buffered.

*Why this matters*

The CASSANDRA-21134 argument applies unchanged: streamed-in data is not 
read-soon, but evicts hot working sets on its way through the page cache.

On bootstrap-heavy nodes ZCS is the primary ingestion path. Leaving it buffered 
defeats half of CASSANDRA-21134 — compaction is cache-safe, the 
higher-throughput stream still pollutes.

*Why it's not a trivial extension*

_DirectCompressedSequentialWriter_ sits inside the compression hierarchy, with 
an aligned buffer between the chunk producer and the channel. ZCS has no such 
hierarchy — bytes arrive pre-formed off the wire, in arbitrary sizes, for every 
component (Data, Index, Statistics, CompressionInfo, Filter, Digest, CRC, TOC).

|Aspect|_DirectCompressedSequentialWriter_|ZCS receiver (proposed)|
|Writes|Compressed chunks at known sizes|Pre-formed bytes off the wire, 
arbitrary sizes|
|Components|Data only|All — Data, Index, Statistics, CompressionInfo, Filter, 
Digest, CRC, TOC|
|Alignment|Aligned buffer in the compression layer|Aligned wrapper at the 
_SequentialWriter_ level|
|Policy fit|Strong — compaction output is cold|Same — streamed-in bytes are 
cold|
|Per-component fit|Net win on Data|Mixed — small components pay overhead for 
little cache benefit|

*Proposed approach*

# New writer _AlignedDirectSequentialWriter_ wrapping _SequentialWriter_ with 
an aligned buffer. Payload-agnostic — operates on the byte stream, not on 
compression chunks.
# Wire into {_}SSTableZeroCopyWriter.makeWriter{_} behind the existing 
{_}getBackgroundWriteDiskAccessMode(){_} gate.
# Classify each component (same shape as 
{_}DataComponent.buildDirectWriteSupport(){_}):
#* _SUPPORTED_: Data, Index, CompressionInfo — large, write-once, cold.
#* {_}UNSUPPORTED_POLICY{_}: Digest, CRC, TOC, Statistics, Filter, Summary — 
small, possibly read-soon.
# Reuse the existing config gate. No new knob.
# Reuse the per-file {_}FileHandle.supportsDirectIO(){_} fallback.

*Open questions / risks*

 - *Mixed-mode receivers*: splitting DIO and buffered components dilutes the 
cache benefit. Acceptable — Data and Index carry the byte volume.
 - *fsync*: replicate the post-truncate fsync fix from 
_DirectCompressedSequentialWriter_ (see 
{_}.claude/tasks/direct-io-writes/writer-context.md{_}, Issue 1).
 - *Block-size detection*: per-component, as the chunked path does.
 - *Tests*: clone _StreamingDirectWriteTest_ for ZCS 
({_}stream_entire_sstables=true{_}), plus byte-equivalence against the buffered 
baseline.
 - *Throttle interaction*: verify DIO alignment overhead doesn't starve the 
sender under {_}entire_sstable_stream_throughput_outbound{_}.

*Out of scope*

 - Sender side — already zero-copy via _sendfile_ / _FileRegion_.
 - Post-completion read-side DIO — covered by 
{_}compaction_read_disk_access_mode{_}.
 - On-disk format — unchanged.

*References*

 - {_}src/java/org/apache/cassandra/io/sstable/SSTableZeroCopyWriter.java{_} — 
current receiver writer.
 - {_}src/java/org/apache/cassandra/io/util/SequentialWriter.java:116{_} — 
{_}openChannel(file, extraOptions...){_}, the seam for 
{_}ExtendedOpenOption.DIRECT{_}.
 - {_}src/java/org/apache/cassandra/io/util/SequentialWriter.java:473{_} — 
{_}writeDirectlyToChannel(ByteBuffer){_}, the per-write path.
 - 
{_}src/java/org/apache/cassandra/io/compress/DirectCompressedSequentialWriter.java{_}
 — reference for aligned buffer, fsync, cleanup.
 - {_}src/java/org/apache/cassandra/io/sstable/format/DataComponent.java:58{_} 
— {_}buildDirectWriteSupport(){_}, per-component classification pattern.
 - 
{_}src/java/org/apache/cassandra/db/streaming/CassandraEntireSSTableStreamReader.java{_}
 — sender side; the receiver routes through _SSTableZeroCopyWriter_.
 - CASSANDRA-21134 — parent ticket.


  was:
Follow-up: Direct I/O writes for zero-copy streaming receiver

*Summary*

Extend _background_write_disk_access_mode=direct_ (CASSANDRA-21134) to cover 
the {*}zero-copy streaming (ZCS) receiver path{*}. ZCS today is "zero-copy" 
only on the sender side; the receiver writes through the kernel page cache, 
competing with hot reads for memory just like uncached compaction output would.

*Current state (after 21134)*

_DirectCompressedSequentialWriter_ engages on the receiver only through 
{_}DataComponent.buildWriter(...){_}. Two streaming receiver paths exist:
|Path|Receiver writer|Goes through {_}DataComponent.buildWriter{_}?|Currently 
DIO-eligible?|
|Chunked streaming ({_}CassandraStreamReader{_} / 
{_}CassandraCompressedStreamReader{_})|_BigTableWriter_ / _BtiTableWriter_ → 
_DataComponent.buildWriter(..., OperationType.STREAM, ...)_|Yes|Yes (compressed 
tables, when configured)|
|Zero-copy streaming 
({_}CassandraEntireSSTableStreamReader{_})|_SSTableZeroCopyWriter_ → 
_ZeroCopySequentialWriter_ extends _SequentialWriter_|*No*|*No — always 
buffered*|

*Why ZCS bypasses DIO today*

_ZeroCopySequentialWriter_ is constructed at _SSTableZeroCopyWriter.makeWriter_ 
(~line 97):

It extends _SequentialWriter_ with no {_}extraOpenOptions{_}, so the channel is 
opened via `SequentialWriter.openChannel(file)` with `StandardOpenOption.READ + 
WRITE` only — a normal buffered `FileChannel`. Each 
`writeDirectlyToChannel(ByteBuffer)` is a plain `channel.write(buf)` through 
the page cache.

"Zero-copy" refers to the *sender* side (`sendfile`/`FileRegion` avoids 
userspace copies during transmit). The receiver still goes through the kernel 
buffer cache.

*Why this matters*

The cache-residency argument that motivated CASSANDRA-21134 applies identically 
to ZCS:
 - Bootstrapped/repaired data streamed onto a node is typically not read-soon.
 - Letting it pass through the page cache evicts hot read working sets and 
creates the same memory-pressure / latency-variance problem compaction output 
caused before DIO.

On bootstrap-heavy nodes (large rebuilds, host replacements), ZCS is the 
*primary* data ingestion path. Leaving it on buffered I/O means 
CASSANDRA-21134's benefit is partially realized — chunked compaction is 
cached-safe, but the much higher-throughput ZCS receive still pollutes the 
cache.

*Why it's not a trivial extension*

_DirectCompressedSequentialWriter_ plugs into the compression hierarchy: it 
sits between the compressed-chunk producer and the channel, with an aligned 
intermediate buffer. The ZCS path has no such hierarchy — bytes arrive 
pre-formed off the wire, in arbitrary framing sizes, for *every* SSTable 
component (Data, Index/Partitions, Statistics, CompressionInfo, Filter, Digest, 
CRC, TOC).


> Direct IO support for ZCS writes 
> ---------------------------------
>
>                 Key: CASSANDRA-21382
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21382
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/SSTable
>            Reporter: Sam Lightfoot
>            Assignee: Sam Lightfoot
>            Priority: Normal
>             Fix For: 6.x
>
>
> Follow-up: Direct I/O writes for zero-copy streaming receiver
> *Summary*
> Extend _background_write_disk_access_mode=direct_ (CASSANDRA-21134) to the 
> ZCS receiver. Today ZCS is zero-copy on the sender only; the receiver writes 
> through the page cache, evicting hot reads — the same problem CASSANDRA-21134 
> solved for compaction.
> *Current state (after 21134)*
> DIO engages only through {_}DataComponent.buildWriter(...){_}. Two receiver 
> paths exist:
> |Path|Receiver writer|Through {_}DataComponent.buildWriter{_}?|DIO-eligible?|
> |Chunked streaming (_CassandraStreamReader_ / 
> _CassandraCompressedStreamReader_)|_BigTableWriter_ / _BtiTableWriter_ → 
> {_}DataComponent.buildWriter(..., OperationType.STREAM, ...){_}|Yes|Yes 
> (compressed tables)|
> |Zero-copy streaming 
> (_CassandraEntireSSTableStreamReader_)|_SSTableZeroCopyWriter_ → 
> _ZeroCopySequentialWriter_ extends _SequentialWriter_|*No*|*No — always 
> buffered*|
> *Why ZCS bypasses DIO today*
> _ZeroCopySequentialWriter_ extends _SequentialWriter_ with no 
> {_}extraOpenOptions{_}. The channel opens with `READ + WRITE` only — a 
> buffered `FileChannel`. Every `writeDirectlyToChannel` call is a plain 
> `channel.write` through the page cache.
> "Zero-copy" describes the sender (`sendfile` / `FileRegion`). The receiver is 
> buffered.
> *Why this matters*
> The CASSANDRA-21134 argument applies unchanged: streamed-in data is not 
> read-soon, but evicts hot working sets on its way through the page cache.
> On bootstrap-heavy nodes ZCS is the primary ingestion path. Leaving it 
> buffered defeats half of CASSANDRA-21134 — compaction is cache-safe, the 
> higher-throughput stream still pollutes.
> *Why it's not a trivial extension*
> _DirectCompressedSequentialWriter_ sits inside the compression hierarchy, 
> with an aligned buffer between the chunk producer and the channel. ZCS has no 
> such hierarchy — bytes arrive pre-formed off the wire, in arbitrary sizes, 
> for every component (Data, Index, Statistics, CompressionInfo, Filter, 
> Digest, CRC, TOC).
> |Aspect|_DirectCompressedSequentialWriter_|ZCS receiver (proposed)|
> |Writes|Compressed chunks at known sizes|Pre-formed bytes off the wire, 
> arbitrary sizes|
> |Components|Data only|All — Data, Index, Statistics, CompressionInfo, Filter, 
> Digest, CRC, TOC|
> |Alignment|Aligned buffer in the compression layer|Aligned wrapper at the 
> _SequentialWriter_ level|
> |Policy fit|Strong — compaction output is cold|Same — streamed-in bytes are 
> cold|
> |Per-component fit|Net win on Data|Mixed — small components pay overhead for 
> little cache benefit|
> *Proposed approach*
> # New writer _AlignedDirectSequentialWriter_ wrapping _SequentialWriter_ with 
> an aligned buffer. Payload-agnostic — operates on the byte stream, not on 
> compression chunks.
> # Wire into {_}SSTableZeroCopyWriter.makeWriter{_} behind the existing 
> {_}getBackgroundWriteDiskAccessMode(){_} gate.
> # Classify each component (same shape as 
> {_}DataComponent.buildDirectWriteSupport(){_}):
> #* _SUPPORTED_: Data, Index, CompressionInfo — large, write-once, cold.
> #* {_}UNSUPPORTED_POLICY{_}: Digest, CRC, TOC, Statistics, Filter, Summary — 
> small, possibly read-soon.
> # Reuse the existing config gate. No new knob.
> # Reuse the per-file {_}FileHandle.supportsDirectIO(){_} fallback.
> *Open questions / risks*
>  - *Mixed-mode receivers*: splitting DIO and buffered components dilutes the 
> cache benefit. Acceptable — Data and Index carry the byte volume.
>  - *fsync*: replicate the post-truncate fsync fix from 
> _DirectCompressedSequentialWriter_ (see 
> {_}.claude/tasks/direct-io-writes/writer-context.md{_}, Issue 1).
>  - *Block-size detection*: per-component, as the chunked path does.
>  - *Tests*: clone _StreamingDirectWriteTest_ for ZCS 
> ({_}stream_entire_sstables=true{_}), plus byte-equivalence against the 
> buffered baseline.
>  - *Throttle interaction*: verify DIO alignment overhead doesn't starve the 
> sender under {_}entire_sstable_stream_throughput_outbound{_}.
> *Out of scope*
>  - Sender side — already zero-copy via _sendfile_ / _FileRegion_.
>  - Post-completion read-side DIO — covered by 
> {_}compaction_read_disk_access_mode{_}.
>  - On-disk format — unchanged.
> *References*
>  - {_}src/java/org/apache/cassandra/io/sstable/SSTableZeroCopyWriter.java{_} 
> — current receiver writer.
>  - {_}src/java/org/apache/cassandra/io/util/SequentialWriter.java:116{_} — 
> {_}openChannel(file, extraOptions...){_}, the seam for 
> {_}ExtendedOpenOption.DIRECT{_}.
>  - {_}src/java/org/apache/cassandra/io/util/SequentialWriter.java:473{_} — 
> {_}writeDirectlyToChannel(ByteBuffer){_}, the per-write path.
>  - 
> {_}src/java/org/apache/cassandra/io/compress/DirectCompressedSequentialWriter.java{_}
>  — reference for aligned buffer, fsync, cleanup.
>  - 
> {_}src/java/org/apache/cassandra/io/sstable/format/DataComponent.java:58{_} — 
> {_}buildDirectWriteSupport(){_}, per-component classification pattern.
>  - 
> {_}src/java/org/apache/cassandra/db/streaming/CassandraEntireSSTableStreamReader.java{_}
>  — sender side; the receiver routes through _SSTableZeroCopyWriter_.
>  - CASSANDRA-21134 — parent ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to