[jira] [Updated] (CASSANDRASC-94) Reduce filesystem calls while streaming SSTables

Francisco Guerrero (Jira) Wed, 17 Apr 2024 17:00:42 -0700


     [ 
https://issues.apache.org/jira/browse/CASSANDRASC-94?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Francisco Guerrero updated CASSANDRASC-94:
------------------------------------------
    Description: 
When streaming snapshotted SSTables from Cassandra Sidecar, Sidecar will 
perform multiple filesystem calls:
- Traverse the data directories to determine the keyspace / table path
- Once found determine if the SSTable file exists under the snapshots directory
- Read the filesystem to obtain the file type and file size
- Read the requested range of the file and stream it

The amount of filesystem calls is manageable for streaming a single SSTable, 
but when a client(s) read multiple SSTables, for example in the case of 
Cassandra Analytics bulk reads, hundred to thousand of requests are performed 
requiring every request to perform the above system calls.

In this improvement, it is proposed introducing several two to reduce the 
amount of system calls while streaming SSTables:

1. *Cache all data file locations*: This is cached once and it will not change 
during the lifecycle of the application. The values come from the Storage 
Service MBean {{getAllDataFileLocations}} method.
2. *snapshot list cache*: to maintain a cache of recently listed snapshot files 
under a snapshot directory. This cache avoids having to access the filesystem 
every time a bulk read client list the snapshot directory. This is a short 
lived cache and can be disabled if the snapshot list is expected to be large.


  was:
When streaming snapshotted SSTables from Cassandra Sidecar, Sidecar will 
perform multiple filesystem calls:
- Traverse the data directories to determine the keyspace / table path
- Once found determine if the SSTable file exists under the snapshots directory
- Read the filesystem to obtain the file type and file size
- Read the requested range of the file and stream it

The amount of filesystem calls is manageable for streaming a single SSTable, 
but when a client(s) read multiple SSTables, for example in the case of 
Cassandra Analytics bulk reads, hundred to thousand of requests are performed 
requiring every request to perform the above system calls.

In this improvement, it is proposed introducing several caches to reduce the 
amount of system calls while streaming SSTables.

- *snapshot list cache*: to maintain a cache of recently listed snapshot files 
under a snapshot directory. This cache avoids having to access the filesystem 
every time a bulk read client list the snapshot directory.
- *table dir cache*: to maintain a cache of recently streamed table directory 
paths. This cache helps avoiding having to traverse the filesystem searching 
for the table directory while running bulk reads for example. Since bulk reads 
can stream tens to hundreds of SSTable components from a snapshot directory, 
this cache helps avoid having to resolve the table directory each time.
- *snapshot path cache*: to maintain a cache of recently streamed snapshot 
SSTable components. This cache avoids having to resolve the snapshot SSTable 
component path during bulk reads. Since bulk reads streams sub-ranges of an 
SSTable component, the resolution can happen multiple times during bulk reads 
for a single SSTable component.
- *file props cache*: to maintain a cache of FileProps of recently streamed 
files. This cache avoids having to validate file properties during bulk reads 
for example where sub-ranges of an SSTable component are streamed, therefore 
reading the file properties can occur multiple times during bulk reads of the 
same file.



> Reduce filesystem calls while streaming SSTables
> ------------------------------------------------
>
>                 Key: CASSANDRASC-94
>                 URL: https://issues.apache.org/jira/browse/CASSANDRASC-94
>             Project: Sidecar for Apache Cassandra
>          Issue Type: Improvement
>          Components: Configuration
>            Reporter: Francisco Guerrero
>            Assignee: Francisco Guerrero
>            Priority: Normal
>              Labels: pull-request-available
>
> When streaming snapshotted SSTables from Cassandra Sidecar, Sidecar will 
> perform multiple filesystem calls:
> - Traverse the data directories to determine the keyspace / table path
> - Once found determine if the SSTable file exists under the snapshots 
> directory
> - Read the filesystem to obtain the file type and file size
> - Read the requested range of the file and stream it
> The amount of filesystem calls is manageable for streaming a single SSTable, 
> but when a client(s) read multiple SSTables, for example in the case of 
> Cassandra Analytics bulk reads, hundred to thousand of requests are performed 
> requiring every request to perform the above system calls.
> In this improvement, it is proposed introducing several two to reduce the 
> amount of system calls while streaming SSTables:
> 1. *Cache all data file locations*: This is cached once and it will not 
> change during the lifecycle of the application. The values come from the 
> Storage Service MBean {{getAllDataFileLocations}} method.
> 2. *snapshot list cache*: to maintain a cache of recently listed snapshot 
> files under a snapshot directory. This cache avoids having to access the 
> filesystem every time a bulk read client list the snapshot directory. This is 
> a short lived cache and can be disabled if the snapshot list is expected to 
> be large.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (CASSANDRASC-94) Reduce filesystem calls while streaming SSTables

Reply via email to