[jira] [Comment Edited] (CASSANDRA-18111) Centralize all snapshot operations to SnapshotManager and cache snapshots

Stefan Miklosovic (Jira) Thu, 15 Aug 2024 03:05:04 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873877#comment-17873877
 ]


Stefan Miklosovic edited comment on CASSANDRA-18111 at 8/15/24 10:04 AM:
-------------------------------------------------------------------------

I was investigating how long it it takes to list snapshots to improve the 
performance in this area.

The test creates 1 keyspace with 10 tables and it creates 100 snapshots per 
table where each table consists of 10 SSTables. Hence, together we list 1000 
snapshots. We run listing 100 times and we made average run time by dividing 
time by 100 (100 rounds).

The baseline is this (1). That lists snapshots by means of 
"StorageService.instance.getSnapshotDetails(Map.of());" which is currently in 
trunk and called when "nodetool listsnapshots" is executed. The time per round 
was 742 milliseconds.

There are various areas where we spend a lot of time and this figure can be 
improved a lot. The biggest time killer is repeated resolving of true disk 
space per snapshot. We could identify these pain points:

1) I empirically tested that the usage of Files.walkTree is performing rather 
poorly. There is quite a speedup by doing Files.list(). Snapshot directory is 
flat when no secondary indices are used so "walking a a tree" does not make 
sense. 
2) There is a performance penalty when we test if a file exists or not
3) There is a performance penalty when we are retrieving size of file.

When I replaced Files.walkTree with Files.list, it took in average 341 
milliseconds which basically cut the time in half.

However, we could do even better. In order to avoid constant checking if a file 
in live data dir exists, every single time, for each snapshot's file, we could 
list all files in live data dirs and put them into a set and then we just ask 
if corresponding file from snapshot is in that set, completely bypassing 
Files.exists() which also saves a ton of IO. This will save proportionally more 
time more snapshots we have.

The code which gathers all files in hot data dirs is something like this:

{code}
cfs.getTracker().getView().allKnownSSTables().forEach(s -> 
files.addAll(s.getAllFilePaths()));
cfsFiles.put(cfs.getKeyspaceName() + "." + cfs.name, files);
{code}

then for each snapshot

{code}
        for (TableSnapshot s : snapshots)
        {
            TabularDataSupport data = (TabularDataSupport) 
snapshotMap.get(s.getTag());
            SnapshotDetailsTabularData.from(s, data, 
cfsFiles.get(s.getKeyspaceName() + '.' + s.getTableName()));
        }
{code}

We basically "inject" all files into each "true snapshot size retrieval" so we 
skip calling Files.exists() altogether.

This solution takes around 128 milliseconds which is cca 5 times less from what 
we currently have.

Keep in mind, that this speedup already includes the check upon retrieval of 
all snapshots from SnapshotManager that there is a corresponding manifest file 
for a snapshot and if not, it would be removed. This preserves the behaviour of 
returning synchronized view of snapshots even they are cached.

For caching of sizes, which might make this even faster, I have not explored 
that yet. My idea was to create a small cache per listing request, if a file is 
included into true disk size computation, we could cache its size so if there 
is another snapshot which also includes this file, we do not need to ask for 
its size again. It is questionable how it would look like in practice. If 
snapshots are getting older and more SSTables are being compacted, more files 
would be put into this cache but there would be fewer "hits" as each new 
snapshot would contain less and less "common" files with the older snapshots, 
minimizing the benefit of caching that.

There is also one improvement we can do in listsnapshots and that is that there 
is a separate call to get the global "trueSnapshotsSize" which is printed at 
the bottom of nodetool listsnapshots. This will basically compute it all over 
again, it will just internally sum up true disk sizes of each snapshot. The 
performance optimisation here would consist of not calling this all yet again 
but we could sum up true disk space sizes of all snapshots we got by calling 
"probe.getSnapshotDetails". The true disk sizes are already there, we just need 
to sum them up, no reason to call a heavyweight probe.trueSnapshotsSize() 
again. However, for this to happen, we would need to start to return longs 
instead of Strings from SnapshotDetailsTabularData for "liveSize" because 
summing up String in nodeprobe is quite cumbersome to do.

For now, I will cleanup the "most performant" branch I have.

(1) https://gist.github.com/smiklosovic/8573186a23be7df571a9e6d175548d91


was (Author: smiklosovic):
I was investigating how long it it takes to list snapshots to improve the 
performance in this area.

The test creates 1 keyspace with 10 tables and it creates 100 snapshots per 
table where each table consists of 10 SSTables. Hence, together we list 1000 
snapshots. We run listing 100 times and we made average run time by dividing 
time by 100 (100 rounds).

The baseline is this (1). That lists snapshots by means of 
"StorageService.instance.getSnapshotDetails(Map.of());" which is currently in 
trunk and called when "nodetool listsnapshots" is executed. The time per round 
was 742 milliseconds.

There are various areas where we spend a lot of time and this figure can be 
improved a lot. The biggest time killer is repeated resolving of true disk 
space per snapshot. We could identify these pain points:

1) I empirically tested that the usage of Files.walkTree is performing rather 
poorly. There is quite a speedup by doing Files.list(). Snapshot directory is 
flat when no secondary indices are used so "walking a a tree" does not make 
sense. 
2) There is a performance penalty when we test if a file exists or not
3) There is a performance penalty when we are retrieving size of file.

When I replaced Files.walkTree with Files.list, it took in average 341 
milliseconds which basically cut the time in half.

However, we could do even better. In order to avoid constant checking if a file 
in live data dir exists, every single time, for each snapshot's file, we could 
list all files in live data dirs and put them into a set and then we just ask 
if corresponding file from snapshot is in that set, completely bypassing 
Files.exists() which also saves a ton of IO. This will save proportionally more 
time more snapshots we have.

The code which gathers all files in hot data dirs is something like this:

{code}
cfs.getTracker().getView().allKnownSSTables().forEach(s -> 
files.addAll(s.getAllFilePaths()));
cfsFiles.put(cfs.getKeyspaceName() + "." + cfs.name, files);
{code}

then for each snapshot

{code}
        for (TableSnapshot s : snapshots)
        {
            TabularDataSupport data = (TabularDataSupport) 
snapshotMap.get(s.getTag());
            SnapshotDetailsTabularData.from(s, data, 
cfsFiles.get(s.getKeyspaceName() + '.' + s.getTableName()));
        }
{code}

We basically "inject" all files into each "true snapshot size retrieval" so we 
skip calling Files.exists() altogether.

This solution takes around 128 milliseconds.

Keep in mind, that this speedup already includes the check upon retrieval of 
all snapshots from SnapshotManager that there is a corresponding manifest file 
for a snapshot and if not, it would be removed. This preserves the behaviour of 
returning synchronized view of snapshots even they are cached.

For caching of sizes, which might make this even faster, I have not explored 
that yet. My idea was to create a small cache per listing request, if a file is 
included into true disk size computation, we could cache its size so if there 
is another snapshot which also includes this file, we do not need to ask for 
its size again. It is questionable how it would look like in practice. If 
snapshots are getting older and more SSTables are being compacted, more files 
would be put into this cache but there would be fewer "hits" as each new 
snapshot would contain less and less "common" files with the older snapshots, 
minimizing the benefit of caching that.

There is also one improvement we can do in listsnapshots and that is that there 
is a separate call to get the global "trueSnapshotsSize" which is printed at 
the bottom of nodetool listsnapshots. This will basically compute it all over 
again, it will just internally sum up true disk sizes of each snapshot. The 
performance optimisation here would consist of not calling this all yet again 
but we could sum up true disk space sizes of all snapshots we got by calling 
"probe.getSnapshotDetails". The true disk sizes are already there, we just need 
to sum them up, no reason to call a heavyweight probe.trueSnapshotsSize() 
again. However, for this to happen, we would need to start to return longs 
instead of Strings from SnapshotDetailsTabularData for "liveSize" because 
summing up String in nodeprobe is quite cumbersome to do.

For now, I will cleanup the "most performant" branch I have.

(1) https://gist.github.com/smiklosovic/8573186a23be7df571a9e6d175548d91

> Centralize all snapshot operations to SnapshotManager and cache snapshots
> -------------------------------------------------------------------------
>
>                 Key: CASSANDRA-18111
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18111
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local/Snapshots
>            Reporter: Paulo Motta
>            Assignee: Stefan Miklosovic
>            Priority: Normal
>             Fix For: 5.x
>
>          Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> Everytime {{nodetool listsnapshots}} is called, all data directories are 
> scanned to find snapshots, what is inefficient.
> For example, fetching the 
> {{org.apache.cassandra.metrics:type=ColumnFamily,name=SnapshotsSize}} metric 
> can take half a second (CASSANDRA-13338).
> This improvement will also allow snapshots to be efficiently queried via 
> virtual tables (CASSANDRA-18102).
> In order to do this, we should:
> a) load all snapshots from disk during initialization
> b) keep a collection of snapshots on {{SnapshotManager}}
> c) update the snapshots collection anytime a new snapshot is taken or cleared
> d) detect when a snapshot is manually removed from disk.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-18111) Centralize all snapshot operations to SnapshotManager and cache snapshots

Reply via email to