[ 
https://issues.apache.org/jira/browse/CASSANDRA-21094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Lightfoot updated CASSANDRA-21094:
--------------------------------------
    Description: 
*Problem*

Cassandra performs sequential scans during compaction and streaming but does 
not inform the kernel of this access pattern. The kernel must infer the pattern 
heuristically, starting with conservative readahead (128KB default) and ramping 
up slowly. This results in suboptimal I/O, particularly on IOPS-constrained 
storage.

Cassandra currently uses {{posix_fadvise}} only for {{POSIX_FADV_DONTNEED}} to 
evict pages after reading, but does not hint sequential access intent 
beforehand.

*Proposed Solution*

Implement a dual file descriptor approach for SSTable access:
 * {_}Normal read path{_}: Existing FD for point queries (default behaviour)
 * {_}Sequential scan path{_}: Dedicated FD with {{POSIX_FADV_SEQUENTIAL}} for 
compaction and streaming

On Linux, {{POSIX_FADV_SEQUENTIAL}} doubles the kernel readahead window (e.g., 
128KB → 256KB). This enables the kernel to prefetch data asynchronously while 
the application processes its current buffer, eliminating I/O wait time.

*Notes*

CockroachDB's Pebble implemented this dual-FD pattern 
([https://github.com/cockroachdb/pebble/pull/817)|https://github.com/cockroachdb/pebble/pull/817]

  was:
*Problem*

Cassandra performs sequential scans during compaction and streaming but does 
not inform the kernel of this access pattern. The kernel must infer the pattern 
heuristically, starting with conservative readahead (128KB default) and ramping 
up slowly. This results in suboptimal I/O, particularly on IOPS-constrained 
storage.

Cassandra currently uses {{posix_fadvise}} only for {{POSIX_FADV_DONTNEED}} to 
evict pages after reading, but does not hint sequential access intent 
beforehand.

*Proposed Solution*

Implement a dual file descriptor approach for SSTable access:
 * {_}Normal read path{_}: Existing FD for point queries (default behaviour)
 * {_}Sequential scan path{_}: Dedicated FD with {{POSIX_FADV_SEQUENTIAL}} for 
compaction and streaming

On Linux, {{POSIX_FADV_SEQUENTIAL}} doubles the kernel readahead window (e.g., 
128KB → 256KB). This enables the kernel to prefetch data asynchronously while 
the application processes its current buffer, eliminating I/O wait time.

*Notes*

CockroachDB's Pebble implemented this dual-FD pattern ([PR 
#817|[https://github.com/cockroachdb/pebble/pull/817]])


> Use POSIX_FADV_SEQUENTIAL for SSTable reads during compaction and streaming
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21094
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21094
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/Compaction
>            Reporter: Sam Lightfoot
>            Assignee: Sam Lightfoot
>            Priority: Normal
>             Fix For: 5.x
>
>
> *Problem*
> Cassandra performs sequential scans during compaction and streaming but does 
> not inform the kernel of this access pattern. The kernel must infer the 
> pattern heuristically, starting with conservative readahead (128KB default) 
> and ramping up slowly. This results in suboptimal I/O, particularly on 
> IOPS-constrained storage.
> Cassandra currently uses {{posix_fadvise}} only for {{POSIX_FADV_DONTNEED}} 
> to evict pages after reading, but does not hint sequential access intent 
> beforehand.
> *Proposed Solution*
> Implement a dual file descriptor approach for SSTable access:
>  * {_}Normal read path{_}: Existing FD for point queries (default behaviour)
>  * {_}Sequential scan path{_}: Dedicated FD with {{POSIX_FADV_SEQUENTIAL}} 
> for compaction and streaming
> On Linux, {{POSIX_FADV_SEQUENTIAL}} doubles the kernel readahead window 
> (e.g., 128KB → 256KB). This enables the kernel to prefetch data 
> asynchronously while the application processes its current buffer, 
> eliminating I/O wait time.
> *Notes*
> CockroachDB's Pebble implemented this dual-FD pattern 
> ([https://github.com/cockroachdb/pebble/pull/817)|https://github.com/cockroachdb/pebble/pull/817]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to