[jira] [Commented] (CASSANDRASC-94) Reduce filesystem calls while streaming SSTables

Francisco Guerrero (Jira) Mon, 22 Jan 2024 11:18:36 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRASC-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809611#comment-17809611
 ]


Francisco Guerrero commented on CASSANDRASC-94:
-----------------------------------------------

> Do you think caching snapshots in the sidecar will be relevant with that in 
> place?

I think it is still relevant until we are able to leverage CASSANDRA-18111. For 
5.x, we can directly query snapshots from the snapshot virtual table 
{{org.apache.cassandra.db.virtual.SnapshotsTable}}.

One of the goals for Sidecar is to bridge gaps between different versions of 
Cassandra. We should probably have a follow up Sidecar patch once 
https://issues.apache.org/jira/browse/CASSANDRA-18111 is merged.

Let me know what are your thoughts on that.

> Reduce filesystem calls while streaming SSTables
> ------------------------------------------------
>
>                 Key: CASSANDRASC-94
>                 URL: https://issues.apache.org/jira/browse/CASSANDRASC-94
>             Project: Sidecar for Apache Cassandra
>          Issue Type: Improvement
>          Components: Configuration
>            Reporter: Francisco Guerrero
>            Assignee: Francisco Guerrero
>            Priority: Normal
>              Labels: pull-request-available
>
> When streaming snapshotted SSTables from Cassandra Sidecar, Sidecar will 
> perform multiple filesystem calls:
> - Traverse the data directories to determine the keyspace / table path
> - Once found determine if the SSTable file exists under the snapshots 
> directory
> - Read the filesystem to obtain the file type and file size
> - Read the requested range of the file and stream it
> The amount of filesystem calls is manageable for streaming a single SSTable, 
> but when a client(s) read multiple SSTables, for example in the case of 
> Cassandra Analytics bulk reads, hundred to thousand of requests are performed 
> requiring every request to perform the above system calls.
> In this improvement, it is proposed introducing several caches to reduce the 
> amount of system calls while streaming SSTables.
> - *snapshot list cache*: to maintain a cache of recently listed snapshot 
> files under a snapshot directory. This cache avoids having to access the 
> filesystem every time a bulk read client list the snapshot directory.
> - *table dir cache*: to maintain a cache of recently streamed table directory 
> paths. This cache helps avoiding having to traverse the filesystem searching 
> for the table directory while running bulk reads for example. Since bulk 
> reads can stream tens to hundreds of SSTable components from a snapshot 
> directory, this cache helps avoid having to resolve the table directory each 
> time.
> - *snapshot path cache*: to maintain a cache of recently streamed snapshot 
> SSTable components. This cache avoids having to resolve the snapshot SSTable 
> component path during bulk reads. Since bulk reads streams sub-ranges of an 
> SSTable component, the resolution can happen multiple times during bulk reads 
> for a single SSTable component.
> - *file props cache*: to maintain a cache of FileProps of recently streamed 
> files. This cache avoids having to validate file properties during bulk reads 
> for example where sub-ranges of an SSTable component are streamed, therefore 
> reading the file properties can occur multiple times during bulk reads of the 
> same file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRASC-94) Reduce filesystem calls while streaming SSTables

Reply via email to