[
https://issues.apache.org/jira/browse/CASSANDRASC-94?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Francisco Guerrero updated CASSANDRASC-94:
------------------------------------------
Description:
When streaming snapshotted SSTables from Cassandra Sidecar, Sidecar will
perform multiple filesystem calls:
- Traverse the data directories to determine the keyspace / table path
- Once found determine if the SSTable file exists under the snapshots directory
- Read the filesystem to obtain the file type and file size
- Read the requested range of the file and stream it
The amount of filesystem calls is manageable for streaming a single SSTable,
but when a client(s) read multiple SSTables, for example in the case of
Cassandra Analytics bulk reads, hundred to thousand of requests are performed
requiring every request to perform the above system calls.
In this improvement, it is proposed introducing several two to reduce the
amount of system calls while streaming SSTables:
1. *Cache all data file locations*: This is cached once and it will not change
during the lifecycle of the application. The values come from the Storage
Service MBean {{getAllDataFileLocations}} method.
2. *snapshot list cache*: to maintain a cache of recently listed snapshot files
under a snapshot directory. This cache avoids having to access the filesystem
every time a bulk read client list the snapshot directory. This is a short
lived cache and can be disabled if the snapshot list is expected to be large.
was:
When streaming snapshotted SSTables from Cassandra Sidecar, Sidecar will
perform multiple filesystem calls:
- Traverse the data directories to determine the keyspace / table path
- Once found determine if the SSTable file exists under the snapshots directory
- Read the filesystem to obtain the file type and file size
- Read the requested range of the file and stream it
The amount of filesystem calls is manageable for streaming a single SSTable,
but when a client(s) read multiple SSTables, for example in the case of
Cassandra Analytics bulk reads, hundred to thousand of requests are performed
requiring every request to perform the above system calls.
In this improvement, it is proposed introducing several caches to reduce the
amount of system calls while streaming SSTables.
- *snapshot list cache*: to maintain a cache of recently listed snapshot files
under a snapshot directory. This cache avoids having to access the filesystem
every time a bulk read client list the snapshot directory.
- *table dir cache*: to maintain a cache of recently streamed table directory
paths. This cache helps avoiding having to traverse the filesystem searching
for the table directory while running bulk reads for example. Since bulk reads
can stream tens to hundreds of SSTable components from a snapshot directory,
this cache helps avoid having to resolve the table directory each time.
- *snapshot path cache*: to maintain a cache of recently streamed snapshot
SSTable components. This cache avoids having to resolve the snapshot SSTable
component path during bulk reads. Since bulk reads streams sub-ranges of an
SSTable component, the resolution can happen multiple times during bulk reads
for a single SSTable component.
- *file props cache*: to maintain a cache of FileProps of recently streamed
files. This cache avoids having to validate file properties during bulk reads
for example where sub-ranges of an SSTable component are streamed, therefore
reading the file properties can occur multiple times during bulk reads of the
same file.
> Reduce filesystem calls while streaming SSTables
> ------------------------------------------------
>
> Key: CASSANDRASC-94
> URL: https://issues.apache.org/jira/browse/CASSANDRASC-94
> Project: Sidecar for Apache Cassandra
> Issue Type: Improvement
> Components: Configuration
> Reporter: Francisco Guerrero
> Assignee: Francisco Guerrero
> Priority: Normal
> Labels: pull-request-available
>
> When streaming snapshotted SSTables from Cassandra Sidecar, Sidecar will
> perform multiple filesystem calls:
> - Traverse the data directories to determine the keyspace / table path
> - Once found determine if the SSTable file exists under the snapshots
> directory
> - Read the filesystem to obtain the file type and file size
> - Read the requested range of the file and stream it
> The amount of filesystem calls is manageable for streaming a single SSTable,
> but when a client(s) read multiple SSTables, for example in the case of
> Cassandra Analytics bulk reads, hundred to thousand of requests are performed
> requiring every request to perform the above system calls.
> In this improvement, it is proposed introducing several two to reduce the
> amount of system calls while streaming SSTables:
> 1. *Cache all data file locations*: This is cached once and it will not
> change during the lifecycle of the application. The values come from the
> Storage Service MBean {{getAllDataFileLocations}} method.
> 2. *snapshot list cache*: to maintain a cache of recently listed snapshot
> files under a snapshot directory. This cache avoids having to access the
> filesystem every time a bulk read client list the snapshot directory. This is
> a short lived cache and can be disabled if the snapshot list is expected to
> be large.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]