[ 
https://issues.apache.org/jira/browse/CASSANDRA-21094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Lightfoot updated CASSANDRA-21094:
--------------------------------------
    Description: 
*Problem*

Cassandra performs sequential scans during compaction and streaming but does 
not inform the kernel of this access pattern. The kernel must infer the pattern 
heuristically, starting with conservative readahead (128KB default) and ramping 
up slowly. This results in suboptimal I/O, particularly on IOPS-constrained 
storage.

Cassandra currently uses {{posix_fadvise}} only for {{POSIX_FADV_DONTNEED}} to 
evict pages after reading, but does not hint sequential access intent 
beforehand.

*Proposed Solution*

Implement a dual file descriptor approach for SSTable access:
 * {_}Normal read path{_}: Existing FD for point queries (default behaviour)
 * {_}Sequential scan path{_}: Dedicated FD with {{POSIX_FADV_SEQUENTIAL}} for 
compaction and streaming

On Linux, {{POSIX_FADV_SEQUENTIAL}} doubles the kernel readahead window (e.g., 
128KB → 256KB). This enables the kernel to prefetch data asynchronously while 
the application processes its current buffer, eliminating I/O wait time.

This optimisation complements the application-level readahead buffer introduced 
in CASSANDRA-15452; while that ticket reduces syscall frequency by reading 
larger blocks (e.g., 256KB) per I/O operation, FADV_SEQUENTIAL enables the 
kernel to asynchronously prefetch subsequent blocks into the page cache in 
parallel, so data is already resident when the application requests its next 
buffer fill.

*Considerations*
 * Increased FD usage during compaction (bounded by concurrent compactions × 
SSTables per compaction)
 * Only benefits buffered I/O modes (standard, mmap) — not applicable for 
Direct I/O

*Notes*

CockroachDB's Pebble implemented this dual-FD pattern 
([https://github.com/cockroachdb/pebble/pull/817)|https://github.com/cockroachdb/pebble/pull/817]

  was:
*Problem*

Cassandra performs sequential scans during compaction and streaming but does 
not inform the kernel of this access pattern. The kernel must infer the pattern 
heuristically, starting with conservative readahead (128KB default) and ramping 
up slowly. This results in suboptimal I/O, particularly on IOPS-constrained 
storage.

Cassandra currently uses {{posix_fadvise}} only for {{POSIX_FADV_DONTNEED}} to 
evict pages after reading, but does not hint sequential access intent 
beforehand.

*Proposed Solution*

Implement a dual file descriptor approach for SSTable access:
 * {_}Normal read path{_}: Existing FD for point queries (default behaviour)
 * {_}Sequential scan path{_}: Dedicated FD with {{POSIX_FADV_SEQUENTIAL}} for 
compaction and streaming

On Linux, {{POSIX_FADV_SEQUENTIAL}} doubles the kernel readahead window (e.g., 
128KB → 256KB). This enables the kernel to prefetch data asynchronously while 
the application processes its current buffer, eliminating I/O wait time.

*Considerations*
 * Increased FD usage during compaction (bounded by concurrent compactions × 
SSTables per compaction)
 * Only benefits buffered I/O modes (standard, mmap) — not applicable for 
Direct I/O

*Notes*

CockroachDB's Pebble implemented this dual-FD pattern 
([https://github.com/cockroachdb/pebble/pull/817)|https://github.com/cockroachdb/pebble/pull/817]


> Use POSIX_FADV_SEQUENTIAL for SSTable reads during compaction and streaming
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21094
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21094
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/Compaction
>            Reporter: Sam Lightfoot
>            Assignee: Sam Lightfoot
>            Priority: Normal
>             Fix For: 5.x
>
>
> *Problem*
> Cassandra performs sequential scans during compaction and streaming but does 
> not inform the kernel of this access pattern. The kernel must infer the 
> pattern heuristically, starting with conservative readahead (128KB default) 
> and ramping up slowly. This results in suboptimal I/O, particularly on 
> IOPS-constrained storage.
> Cassandra currently uses {{posix_fadvise}} only for {{POSIX_FADV_DONTNEED}} 
> to evict pages after reading, but does not hint sequential access intent 
> beforehand.
> *Proposed Solution*
> Implement a dual file descriptor approach for SSTable access:
>  * {_}Normal read path{_}: Existing FD for point queries (default behaviour)
>  * {_}Sequential scan path{_}: Dedicated FD with {{POSIX_FADV_SEQUENTIAL}} 
> for compaction and streaming
> On Linux, {{POSIX_FADV_SEQUENTIAL}} doubles the kernel readahead window 
> (e.g., 128KB → 256KB). This enables the kernel to prefetch data 
> asynchronously while the application processes its current buffer, 
> eliminating I/O wait time.
> This optimisation complements the application-level readahead buffer 
> introduced in CASSANDRA-15452; while that ticket reduces syscall frequency by 
> reading larger blocks (e.g., 256KB) per I/O operation, FADV_SEQUENTIAL 
> enables the kernel to asynchronously prefetch subsequent blocks into the page 
> cache in parallel, so data is already resident when the application requests 
> its next buffer fill.
> *Considerations*
>  * Increased FD usage during compaction (bounded by concurrent compactions × 
> SSTables per compaction)
>  * Only benefits buffered I/O modes (standard, mmap) — not applicable for 
> Direct I/O
> *Notes*
> CockroachDB's Pebble implemented this dual-FD pattern 
> ([https://github.com/cockroachdb/pebble/pull/817)|https://github.com/cockroachdb/pebble/pull/817]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to