[
https://issues.apache.org/jira/browse/CASSANDRA-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873877#comment-17873877
]
Stefan Miklosovic edited comment on CASSANDRA-18111 at 8/15/24 10:04 AM:
-------------------------------------------------------------------------
I was investigating how long it it takes to list snapshots to improve the
performance in this area.
The test creates 1 keyspace with 10 tables and it creates 100 snapshots per
table where each table consists of 10 SSTables. Hence, together we list 1000
snapshots. We run listing 100 times and we made average run time by dividing
time by 100 (100 rounds).
The baseline is this (1). That lists snapshots by means of
"StorageService.instance.getSnapshotDetails(Map.of());" which is currently in
trunk and called when "nodetool listsnapshots" is executed. The time per round
was 742 milliseconds.
There are various areas where we spend a lot of time and this figure can be
improved a lot. The biggest time killer is repeated resolving of true disk
space per snapshot. We could identify these pain points:
1) I empirically tested that the usage of Files.walkTree is performing rather
poorly. There is quite a speedup by doing Files.list(). Snapshot directory is
flat when no secondary indices are used so "walking a a tree" does not make
sense.
2) There is a performance penalty when we test if a file exists or not
3) There is a performance penalty when we are retrieving size of file.
When I replaced Files.walkTree with Files.list, it took in average 341
milliseconds which basically cut the time in half.
However, we could do even better. In order to avoid constant checking if a file
in live data dir exists, every single time, for each snapshot's file, we could
list all files in live data dirs and put them into a set and then we just ask
if corresponding file from snapshot is in that set, completely bypassing
Files.exists() which also saves a ton of IO. This will save proportionally more
time more snapshots we have.
The code which gathers all files in hot data dirs is something like this:
{code}
cfs.getTracker().getView().allKnownSSTables().forEach(s ->
files.addAll(s.getAllFilePaths()));
cfsFiles.put(cfs.getKeyspaceName() + "." + cfs.name, files);
{code}
then for each snapshot
{code}
for (TableSnapshot s : snapshots)
{
TabularDataSupport data = (TabularDataSupport)
snapshotMap.get(s.getTag());
SnapshotDetailsTabularData.from(s, data,
cfsFiles.get(s.getKeyspaceName() + '.' + s.getTableName()));
}
{code}
We basically "inject" all files into each "true snapshot size retrieval" so we
skip calling Files.exists() altogether.
This solution takes around 128 milliseconds which is cca 5 times less from what
we currently have.
Keep in mind, that this speedup already includes the check upon retrieval of
all snapshots from SnapshotManager that there is a corresponding manifest file
for a snapshot and if not, it would be removed. This preserves the behaviour of
returning synchronized view of snapshots even they are cached.
For caching of sizes, which might make this even faster, I have not explored
that yet. My idea was to create a small cache per listing request, if a file is
included into true disk size computation, we could cache its size so if there
is another snapshot which also includes this file, we do not need to ask for
its size again. It is questionable how it would look like in practice. If
snapshots are getting older and more SSTables are being compacted, more files
would be put into this cache but there would be fewer "hits" as each new
snapshot would contain less and less "common" files with the older snapshots,
minimizing the benefit of caching that.
There is also one improvement we can do in listsnapshots and that is that there
is a separate call to get the global "trueSnapshotsSize" which is printed at
the bottom of nodetool listsnapshots. This will basically compute it all over
again, it will just internally sum up true disk sizes of each snapshot. The
performance optimisation here would consist of not calling this all yet again
but we could sum up true disk space sizes of all snapshots we got by calling
"probe.getSnapshotDetails". The true disk sizes are already there, we just need
to sum them up, no reason to call a heavyweight probe.trueSnapshotsSize()
again. However, for this to happen, we would need to start to return longs
instead of Strings from SnapshotDetailsTabularData for "liveSize" because
summing up String in nodeprobe is quite cumbersome to do.
For now, I will cleanup the "most performant" branch I have.
(1) https://gist.github.com/smiklosovic/8573186a23be7df571a9e6d175548d91
was (Author: smiklosovic):
I was investigating how long it it takes to list snapshots to improve the
performance in this area.
The test creates 1 keyspace with 10 tables and it creates 100 snapshots per
table where each table consists of 10 SSTables. Hence, together we list 1000
snapshots. We run listing 100 times and we made average run time by dividing
time by 100 (100 rounds).
The baseline is this (1). That lists snapshots by means of
"StorageService.instance.getSnapshotDetails(Map.of());" which is currently in
trunk and called when "nodetool listsnapshots" is executed. The time per round
was 742 milliseconds.
There are various areas where we spend a lot of time and this figure can be
improved a lot. The biggest time killer is repeated resolving of true disk
space per snapshot. We could identify these pain points:
1) I empirically tested that the usage of Files.walkTree is performing rather
poorly. There is quite a speedup by doing Files.list(). Snapshot directory is
flat when no secondary indices are used so "walking a a tree" does not make
sense.
2) There is a performance penalty when we test if a file exists or not
3) There is a performance penalty when we are retrieving size of file.
When I replaced Files.walkTree with Files.list, it took in average 341
milliseconds which basically cut the time in half.
However, we could do even better. In order to avoid constant checking if a file
in live data dir exists, every single time, for each snapshot's file, we could
list all files in live data dirs and put them into a set and then we just ask
if corresponding file from snapshot is in that set, completely bypassing
Files.exists() which also saves a ton of IO. This will save proportionally more
time more snapshots we have.
The code which gathers all files in hot data dirs is something like this:
{code}
cfs.getTracker().getView().allKnownSSTables().forEach(s ->
files.addAll(s.getAllFilePaths()));
cfsFiles.put(cfs.getKeyspaceName() + "." + cfs.name, files);
{code}
then for each snapshot
{code}
for (TableSnapshot s : snapshots)
{
TabularDataSupport data = (TabularDataSupport)
snapshotMap.get(s.getTag());
SnapshotDetailsTabularData.from(s, data,
cfsFiles.get(s.getKeyspaceName() + '.' + s.getTableName()));
}
{code}
We basically "inject" all files into each "true snapshot size retrieval" so we
skip calling Files.exists() altogether.
This solution takes around 128 milliseconds.
Keep in mind, that this speedup already includes the check upon retrieval of
all snapshots from SnapshotManager that there is a corresponding manifest file
for a snapshot and if not, it would be removed. This preserves the behaviour of
returning synchronized view of snapshots even they are cached.
For caching of sizes, which might make this even faster, I have not explored
that yet. My idea was to create a small cache per listing request, if a file is
included into true disk size computation, we could cache its size so if there
is another snapshot which also includes this file, we do not need to ask for
its size again. It is questionable how it would look like in practice. If
snapshots are getting older and more SSTables are being compacted, more files
would be put into this cache but there would be fewer "hits" as each new
snapshot would contain less and less "common" files with the older snapshots,
minimizing the benefit of caching that.
There is also one improvement we can do in listsnapshots and that is that there
is a separate call to get the global "trueSnapshotsSize" which is printed at
the bottom of nodetool listsnapshots. This will basically compute it all over
again, it will just internally sum up true disk sizes of each snapshot. The
performance optimisation here would consist of not calling this all yet again
but we could sum up true disk space sizes of all snapshots we got by calling
"probe.getSnapshotDetails". The true disk sizes are already there, we just need
to sum them up, no reason to call a heavyweight probe.trueSnapshotsSize()
again. However, for this to happen, we would need to start to return longs
instead of Strings from SnapshotDetailsTabularData for "liveSize" because
summing up String in nodeprobe is quite cumbersome to do.
For now, I will cleanup the "most performant" branch I have.
(1) https://gist.github.com/smiklosovic/8573186a23be7df571a9e6d175548d91
> Centralize all snapshot operations to SnapshotManager and cache snapshots
> -------------------------------------------------------------------------
>
> Key: CASSANDRA-18111
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18111
> Project: Cassandra
> Issue Type: Improvement
> Components: Local/Snapshots
> Reporter: Paulo Motta
> Assignee: Stefan Miklosovic
> Priority: Normal
> Fix For: 5.x
>
> Time Spent: 4h 50m
> Remaining Estimate: 0h
>
> Everytime {{nodetool listsnapshots}} is called, all data directories are
> scanned to find snapshots, what is inefficient.
> For example, fetching the
> {{org.apache.cassandra.metrics:type=ColumnFamily,name=SnapshotsSize}} metric
> can take half a second (CASSANDRA-13338).
> This improvement will also allow snapshots to be efficiently queried via
> virtual tables (CASSANDRA-18102).
> In order to do this, we should:
> a) load all snapshots from disk during initialization
> b) keep a collection of snapshots on {{SnapshotManager}}
> c) update the snapshots collection anytime a new snapshot is taken or cleared
> d) detect when a snapshot is manually removed from disk.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]