[ 
https://issues.apache.org/jira/browse/CASSANDRA-21382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Lightfoot updated CASSANDRA-21382:
--------------------------------------
    Description: 
Follow-up: Direct I/O writes for zero-copy streaming receiver

*Summary*

Extend _background_write_disk_access_mode=direct_ (CASSANDRA-21134) to cover 
the {*}zero-copy streaming (ZCS) receiver path{*}. ZCS today is "zero-copy" 
only on the sender side; the receiver writes through the kernel page cache, 
competing with hot reads for memory just like uncached compaction output would.

*Current state (after 21134)*

_DirectCompressedSequentialWriter_ engages on the receiver only through 
{_}DataComponent.buildWriter(...){_}. Two streaming receiver paths exist:
|Path|Receiver writer|Goes through {_}DataComponent.buildWriter{_}?|Currently 
DIO-eligible?|
|Chunked streaming ({_}CassandraStreamReader{_} / 
{_}CassandraCompressedStreamReader{_})|_BigTableWriter_ / _BtiTableWriter_ → 
_DataComponent.buildWriter(..., OperationType.STREAM, ...)_|Yes|Yes (compressed 
tables, when configured)|
|Zero-copy streaming 
({_}CassandraEntireSSTableStreamReader{_})|_SSTableZeroCopyWriter_ → 
_ZeroCopySequentialWriter_ extends _SequentialWriter_|*No*|*No — always 
buffered*|

*Why ZCS bypasses DIO today*

_ZeroCopySequentialWriter_ is constructed at _SSTableZeroCopyWriter.makeWriter_ 
(~line 97):

It extends _SequentialWriter_ with no {_}extraOpenOptions{_}, so the channel is 
opened via `SequentialWriter.openChannel(file)` with `StandardOpenOption.READ + 
WRITE` only — a normal buffered `FileChannel`. Each 
`writeDirectlyToChannel(ByteBuffer)` is a plain `channel.write(buf)` through 
the page cache.

"Zero-copy" refers to the *sender* side (`sendfile`/`FileRegion` avoids 
userspace copies during transmit). The receiver still goes through the kernel 
buffer cache.

*Why this matters*

The cache-residency argument that motivated CASSANDRA-21134 applies identically 
to ZCS:
 - Bootstrapped/repaired data streamed onto a node is typically not read-soon.
 - Letting it pass through the page cache evicts hot read working sets and 
creates the same memory-pressure / latency-variance problem compaction output 
caused before DIO.

On bootstrap-heavy nodes (large rebuilds, host replacements), ZCS is the 
*primary* data ingestion path. Leaving it on buffered I/O means 
CASSANDRA-21134's benefit is partially realized — chunked compaction is 
cached-safe, but the much higher-throughput ZCS receive still pollutes the 
cache.

*Why it's not a trivial extension*

_DirectCompressedSequentialWriter_ plugs into the compression hierarchy: it 
sits between the compressed-chunk producer and the channel, with an aligned 
intermediate buffer. The ZCS path has no such hierarchy — bytes arrive 
pre-formed off the wire, in arbitrary framing sizes, for *every* SSTable 
component (Data, Index/Partitions, Statistics, CompressionInfo, Filter, Digest, 
CRC, TOC).

  was:
Follow-up: Direct I/O writes for zero-copy streaming receiver

*Summary*

Extend _background_write_disk_access_mode=direct_ (CASSANDRA-21134) to cover 
the {*}zero-copy streaming (ZCS) receiver path{*}. ZCS today is "zero-copy" 
only on the sender side; the receiver writes through the kernel page cache, 
competing with hot reads for memory just like uncached compaction output would.

*Current state (after 21134)*

_DirectCompressedSequentialWriter_ engages on the receiver only through 
{_}DataComponent.buildWriter(...){_}. Two streaming receiver paths exist:
|Path|Receiver writer|Goes through `DataComponent.buildWriter`?|Currently 
DIO-eligible?|
|Chunked streaming (`CassandraStreamReader` / 
`CassandraCompressedStreamReader`)|`BigTableWriter` / `BtiTableWriter` → 
`DataComponent.buildWriter(..., OperationType.STREAM, ...)`|Yes|Yes (compressed 
tables, when configured)|
|Zero-copy streaming 
(`CassandraEntireSSTableStreamReader`)|`SSTableZeroCopyWriter` → 
`ZeroCopySequentialWriter extends SequentialWriter`|*No*|*No — always buffered*|

*Why ZCS bypasses DIO today*

_ZeroCopySequentialWriter_ is constructed at _SSTableZeroCopyWriter.makeWriter_ 
(~line 97):

It extends _SequentialWriter_ with no {_}extraOpenOptions{_}, so the channel is 
opened via `SequentialWriter.openChannel(file)` with `StandardOpenOption.READ + 
WRITE` only — a normal buffered `FileChannel`. Each 
`writeDirectlyToChannel(ByteBuffer)` is a plain `channel.write(buf)` through 
the page cache.

"Zero-copy" refers to the *sender* side (`sendfile`/`FileRegion` avoids 
userspace copies during transmit). The receiver still goes through the kernel 
buffer cache.

*Why this matters*

The cache-residency argument that motivated CASSANDRA-21134 applies identically 
to ZCS:
 - Bootstrapped/repaired data streamed onto a node is typically not read-soon.
 - Letting it pass through the page cache evicts hot read working sets and 
creates the same memory-pressure / latency-variance problem compaction output 
caused before DIO.

On bootstrap-heavy nodes (large rebuilds, host replacements), ZCS is the 
*primary* data ingestion path. Leaving it on buffered I/O means 
CASSANDRA-21134's benefit is partially realized — chunked compaction is 
cached-safe, but the much higher-throughput ZCS receive still pollutes the 
cache.

*Why it's not a trivial extension*

_DirectCompressedSequentialWriter_ plugs into the compression hierarchy: it 
sits between the compressed-chunk producer and the channel, with an aligned 
intermediate buffer. The ZCS path has no such hierarchy — bytes arrive 
pre-formed off the wire, in arbitrary framing sizes, for *every* SSTable 
component (Data, Index/Partitions, Statistics, CompressionInfo, Filter, Digest, 
CRC, TOC).


> Direct IO support for ZCS writes 
> ---------------------------------
>
>                 Key: CASSANDRA-21382
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21382
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/SSTable
>            Reporter: Sam Lightfoot
>            Assignee: Sam Lightfoot
>            Priority: Normal
>             Fix For: 6.x
>
>
> Follow-up: Direct I/O writes for zero-copy streaming receiver
> *Summary*
> Extend _background_write_disk_access_mode=direct_ (CASSANDRA-21134) to cover 
> the {*}zero-copy streaming (ZCS) receiver path{*}. ZCS today is "zero-copy" 
> only on the sender side; the receiver writes through the kernel page cache, 
> competing with hot reads for memory just like uncached compaction output 
> would.
> *Current state (after 21134)*
> _DirectCompressedSequentialWriter_ engages on the receiver only through 
> {_}DataComponent.buildWriter(...){_}. Two streaming receiver paths exist:
> |Path|Receiver writer|Goes through {_}DataComponent.buildWriter{_}?|Currently 
> DIO-eligible?|
> |Chunked streaming ({_}CassandraStreamReader{_} / 
> {_}CassandraCompressedStreamReader{_})|_BigTableWriter_ / _BtiTableWriter_ → 
> _DataComponent.buildWriter(..., OperationType.STREAM, ...)_|Yes|Yes 
> (compressed tables, when configured)|
> |Zero-copy streaming 
> ({_}CassandraEntireSSTableStreamReader{_})|_SSTableZeroCopyWriter_ → 
> _ZeroCopySequentialWriter_ extends _SequentialWriter_|*No*|*No — always 
> buffered*|
> *Why ZCS bypasses DIO today*
> _ZeroCopySequentialWriter_ is constructed at 
> _SSTableZeroCopyWriter.makeWriter_ (~line 97):
> It extends _SequentialWriter_ with no {_}extraOpenOptions{_}, so the channel 
> is opened via `SequentialWriter.openChannel(file)` with 
> `StandardOpenOption.READ + WRITE` only — a normal buffered `FileChannel`. 
> Each `writeDirectlyToChannel(ByteBuffer)` is a plain `channel.write(buf)` 
> through the page cache.
> "Zero-copy" refers to the *sender* side (`sendfile`/`FileRegion` avoids 
> userspace copies during transmit). The receiver still goes through the kernel 
> buffer cache.
> *Why this matters*
> The cache-residency argument that motivated CASSANDRA-21134 applies 
> identically to ZCS:
>  - Bootstrapped/repaired data streamed onto a node is typically not read-soon.
>  - Letting it pass through the page cache evicts hot read working sets and 
> creates the same memory-pressure / latency-variance problem compaction output 
> caused before DIO.
> On bootstrap-heavy nodes (large rebuilds, host replacements), ZCS is the 
> *primary* data ingestion path. Leaving it on buffered I/O means 
> CASSANDRA-21134's benefit is partially realized — chunked compaction is 
> cached-safe, but the much higher-throughput ZCS receive still pollutes the 
> cache.
> *Why it's not a trivial extension*
> _DirectCompressedSequentialWriter_ plugs into the compression hierarchy: it 
> sits between the compressed-chunk producer and the channel, with an aligned 
> intermediate buffer. The ZCS path has no such hierarchy — bytes arrive 
> pre-formed off the wire, in arbitrary framing sizes, for *every* SSTable 
> component (Data, Index/Partitions, Statistics, CompressionInfo, Filter, 
> Digest, CRC, TOC).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to