[
https://issues.apache.org/jira/browse/CASSANDRA-21094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sam Lightfoot updated CASSANDRA-21094:
--------------------------------------
Description:
*Problem*
Cassandra performs sequential scans during compaction and streaming but does
not inform the kernel of this access pattern. The kernel must infer the pattern
heuristically, starting with conservative readahead (128KB default) and ramping
up slowly. This results in suboptimal I/O, particularly on IOPS-constrained
storage.
Cassandra currently uses {{posix_fadvise}} only for {{POSIX_FADV_DONTNEED}} to
evict pages after reading, but does not hint sequential access intent
beforehand.
*Proposed Solution*
Implement a dual file descriptor approach for SSTable access:
* {_}Normal read path{_}: Existing FD for point queries (default behaviour)
* {_}Sequential scan path{_}: Dedicated FD with {{POSIX_FADV_SEQUENTIAL}} for
compaction and streaming
On Linux, {{POSIX_FADV_SEQUENTIAL}} doubles the kernel readahead window (e.g.,
128KB → 256KB). This enables the kernel to prefetch data asynchronously while
the application processes its current buffer, eliminating I/O wait time.
*Notes*
CockroachDB's Pebble implemented this dual-FD pattern
([https://github.com/cockroachdb/pebble/pull/817)|https://github.com/cockroachdb/pebble/pull/817]
was:
*Problem*
Cassandra performs sequential scans during compaction and streaming but does
not inform the kernel of this access pattern. The kernel must infer the pattern
heuristically, starting with conservative readahead (128KB default) and ramping
up slowly. This results in suboptimal I/O, particularly on IOPS-constrained
storage.
Cassandra currently uses {{posix_fadvise}} only for {{POSIX_FADV_DONTNEED}} to
evict pages after reading, but does not hint sequential access intent
beforehand.
*Proposed Solution*
Implement a dual file descriptor approach for SSTable access:
* {_}Normal read path{_}: Existing FD for point queries (default behaviour)
* {_}Sequential scan path{_}: Dedicated FD with {{POSIX_FADV_SEQUENTIAL}} for
compaction and streaming
On Linux, {{POSIX_FADV_SEQUENTIAL}} doubles the kernel readahead window (e.g.,
128KB → 256KB). This enables the kernel to prefetch data asynchronously while
the application processes its current buffer, eliminating I/O wait time.
*Notes*
CockroachDB's Pebble implemented this dual-FD pattern ([PR
#817|[https://github.com/cockroachdb/pebble/pull/817]])
> Use POSIX_FADV_SEQUENTIAL for SSTable reads during compaction and streaming
> ---------------------------------------------------------------------------
>
> Key: CASSANDRA-21094
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21094
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: Local/Compaction
> Reporter: Sam Lightfoot
> Assignee: Sam Lightfoot
> Priority: Normal
> Fix For: 5.x
>
>
> *Problem*
> Cassandra performs sequential scans during compaction and streaming but does
> not inform the kernel of this access pattern. The kernel must infer the
> pattern heuristically, starting with conservative readahead (128KB default)
> and ramping up slowly. This results in suboptimal I/O, particularly on
> IOPS-constrained storage.
> Cassandra currently uses {{posix_fadvise}} only for {{POSIX_FADV_DONTNEED}}
> to evict pages after reading, but does not hint sequential access intent
> beforehand.
> *Proposed Solution*
> Implement a dual file descriptor approach for SSTable access:
> * {_}Normal read path{_}: Existing FD for point queries (default behaviour)
> * {_}Sequential scan path{_}: Dedicated FD with {{POSIX_FADV_SEQUENTIAL}}
> for compaction and streaming
> On Linux, {{POSIX_FADV_SEQUENTIAL}} doubles the kernel readahead window
> (e.g., 128KB → 256KB). This enables the kernel to prefetch data
> asynchronously while the application processes its current buffer,
> eliminating I/O wait time.
> *Notes*
> CockroachDB's Pebble implemented this dual-FD pattern
> ([https://github.com/cockroachdb/pebble/pull/817)|https://github.com/cockroachdb/pebble/pull/817]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]