[ 
https://issues.apache.org/jira/browse/CASSANDRA-21094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Lightfoot updated CASSANDRA-21094:
--------------------------------------
    Description: 
Problem:

Cassandra performs sequential scans during compaction and streaming but does 
not inform the kernel of this access pattern. The kernel must infer the pattern 
heuristically, starting with conservative readahead (128KB default) and ramping 
up slowly. This results in suboptimal I/O, particularly on IOPS-constrained 
storage.

Cassandra currently uses \{{posix_fadvise}} only for \{{POSIX_FADV_DONTNEED}} 
to evict pages after reading, but does not hint sequential access intent 
beforehand.

Proposed Solution:

Implement a dual file descriptor approach for SSTable access:
 * {_}Normal read path{_}: Existing FD for point queries (default behaviour)
 * {_}Sequential scan path{_}: Dedicated FD with \{{POSIX_FADV_SEQUENTIAL}} for 
compaction and streaming

On Linux, \{{POSIX_FADV_SEQUENTIAL}} doubles the kernel readahead window (e.g., 
128KB → 256KB). This enables the kernel to prefetch data asynchronously while 
the application processes its current buffer, eliminating I/O wait time.

Notes:

CockroachDB's Pebble implemented this dual-FD pattern ([PR 
#817|[https://github.com/cockroachdb/pebble/pull/817]]):

  was:
If we use direct io to read SSTables during compaction, we can avoid polluting 
the page cache with data we're about to delete.  As another side effect, we 
also evict pages to make room for whatever we're putting in.  This unnecessary 
churn leads to higher CPU overhead and can cause dips in client read latency, 
as we're going to be evicting pages that could be used to serve those reads.

This is most notable with STCS as the SSTables get larger, potentially evicting 
the entire hot dataset out of cache, but is affected by every compaction 
strategy.

This is a follow up to be done after CASSANDRA-15452 since we will have an 
internal buffer.


> Use POSIX_FADV_SEQUENTIAL for SSTable reads during compaction and streaming
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21094
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21094
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/Compaction
>            Reporter: Sam Lightfoot
>            Assignee: Sam Lightfoot
>            Priority: Normal
>             Fix For: 5.x
>
>
> Problem:
> Cassandra performs sequential scans during compaction and streaming but does 
> not inform the kernel of this access pattern. The kernel must infer the 
> pattern heuristically, starting with conservative readahead (128KB default) 
> and ramping up slowly. This results in suboptimal I/O, particularly on 
> IOPS-constrained storage.
> Cassandra currently uses \{{posix_fadvise}} only for \{{POSIX_FADV_DONTNEED}} 
> to evict pages after reading, but does not hint sequential access intent 
> beforehand.
> Proposed Solution:
> Implement a dual file descriptor approach for SSTable access:
>  * {_}Normal read path{_}: Existing FD for point queries (default behaviour)
>  * {_}Sequential scan path{_}: Dedicated FD with \{{POSIX_FADV_SEQUENTIAL}} 
> for compaction and streaming
> On Linux, \{{POSIX_FADV_SEQUENTIAL}} doubles the kernel readahead window 
> (e.g., 128KB → 256KB). This enables the kernel to prefetch data 
> asynchronously while the application processes its current buffer, 
> eliminating I/O wait time.
> Notes:
> CockroachDB's Pebble implemented this dual-FD pattern ([PR 
> #817|[https://github.com/cockroachdb/pebble/pull/817]]):



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to