[
https://issues.apache.org/jira/browse/CASSANDRA-20250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18035360#comment-18035360
]
Stefan Miklosovic edited comment on CASSANDRA-20250 at 11/4/25 3:16 PM:
------------------------------------------------------------------------
Chiming in as I was author of CASSANDRA-18111.
There are two places where computation on snapshots and going to disk might
occur.
The first place is {{{}TableSnapshot.computeSizeOnDiskBytes(){}}}. You call
that either when querying virtual table (system_views.snapshots) or when you do
listing of snapshots via nodetool / jmx. However, this is computed only once.
The value is cached and is not going to disk anymore once done. And this is not
used in metrics anyway.
Secondly, there is TableSnapshot.computeTrueSizeBytes(). Before
CASSANDRA-18111, this was really going to the disk every time and it was slow,
but after rewriting (1) what it does is that
1) it is not resolving size of manifest nor schema file size, these are all
cached / computed just once when snapshot is created / loaded
2) then it lists snapshot files - yes we go to the disk
3) then we iterate over such list and and ever go to disk to resolve the size
of a particular snapshot only in case that file is not among the result of
"\{{getLiveFileFromSnapshotFile}}".
The logic behind "true snapshot size" is that if you have 5 SSTables in a table
and 5 SSTables in a snapshot and they are both same (as it is a hardlink - what
is in snapshot is in data dir) - then true snapshot size is ... 0. So we do not
need to go to disk for that. But if a snapshot contains 5 SSTables and we have
3 SSTables in live data dir, then we need to go to disk and get the size of two
SSTables - that will be "true size of snapshot".
I do not think "computeTrueSizeBytes" can be more effective than that. Before
the rewrite,the logic was way more involved and complicated and produced a lot
of "garbage" as a byproduct of resolving the true size. So if anything then in
trunk you should see significant speedup at least. There is perf test I was
conducting as part of (2), also linked here, to see before / after performance
wise so we should be in way better position right now even without caching.
[^Average_Time_vs_Threads_Combined_snapshot_listing.png]
[^Average_Time_vs_Threads_Combined_true_snapshot_size.png]
[^Percentiles_vs_Threads_Combined_snapshot_listing.png]
[^Percentiles_vs_Threads_Combined_true_snapshot_size.png]
[^Throughput_vs_Threads_Combined_snapshot_listing.png]
[^Throughput_vs_Threads_Combined _true_snapshot_size.png]
(1)
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/snapshot/TableSnapshot.java#L243-L263]
(2) https://issues.apache.org/jira/browse/CASSANDRA-13338
was (Author: smiklosovic):
Chiming in as I was author of CASSANDRA-18111.
There are two places where computation on snapshots and going to disk might
occur.
The first place is {{{}TableSnapshot.computeSizeOnDiskBytes(){}}}. You call
that either when querying virtual table (system_views.snapshots) or when you do
listing of snapshots via nodetool / jmx. However, this is computed only once.
The value is cached and is not going to disk anymore once done. And this is not
used in metrics anyway.
Secondly, there is TableSnapshot.computeTrueSizeBytes(). Before
CASSANDRA-18111, this was really going to the disk every time and it was slow,
but after rewriting (1) what it does is that
1) it is not resolving size of manifest nor schema file size, these are all
cached / computed just once when snapshot is created / loaded
2) then it lists snapshot files - yes we go to the disk
3) then we iterate over such list and and ever go to disk to resolve the size
of a particular snapshot only in case that file is not among the result of
"getLiveFileFromSnapshotFile".
The logic behind "true snapshot size" is that if you have 5 SSTables in a table
and 5 SSTables in a snapshot and they are both same (as it is a hardlink - what
is in snapshot is in data dir) - then true snapshot size is ... 0. So we does
not need to go to disk for that. But if a snapshot contains 5 SSTables and we
have 3 SSTables in live data dir, then we need to go to disk and get the size
of two SSTables - that will be "true size of snapshot".
I do not think "computeTrueSizeBytes" can be more effective than that. Before
the rewrite,the logic was way more involved and complicated and produced a lot
of "garbage" as a byproduct of resolving the true size. So if anything then in
trunk you should see significant speedup at least. There is perf test I was
conducting as part of (2), also linked here, to see before / after performance
wise so we should be in way better position right now even without caching.
[^Average_Time_vs_Threads_Combined_snapshot_listing.png]
[^Average_Time_vs_Threads_Combined_true_snapshot_size.png]
[^Percentiles_vs_Threads_Combined_snapshot_listing.png]
[^Percentiles_vs_Threads_Combined_true_snapshot_size.png]
[^Throughput_vs_Threads_Combined_snapshot_listing.png]
[^Throughput_vs_Threads_Combined _true_snapshot_size.png]
(1)
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/snapshot/TableSnapshot.java#L243-L263]
(2) https://issues.apache.org/jira/browse/CASSANDRA-13338
> Optimize Counter, Meter and Histogram metrics using thread local counters
> -------------------------------------------------------------------------
>
> Key: CASSANDRA-20250
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20250
> Project: Apache Cassandra
> Issue Type: New Feature
> Components: Observability/Metrics
> Reporter: Dmitry Konstantinov
> Assignee: Dmitry Konstantinov
> Priority: Normal
> Fix For: 5.1
>
> Attachments: 5.1_profile_cpu.html,
> 5.1_profile_cpu_without_metrics.html, 5.1_tl4_profile_cpu.html,
> CASSANDRA-20250_ci_summary.html, CASSANDRA-20250_results_details.tar.xz,
> Histogram_AtomicLong.png, async_profiler_cpu_profiles.zip,
> cas_reverse_graph_metrics.png, cpu_profile_insert.html,
> image-2025-02-18-23-22-19-983.png, jmh-result.json, vmstat.log,
> vmstat_without_metrics.log
>
> Time Spent: 11h 50m
> Remaining Estimate: 0h
>
> Cassandra has a lot of metrics collected, many of them are collected per
> table, so their instance number is multiplied by number of tables. From one
> side it gives a better observability, from another side metrics are not for
> free, there is an overhead associated with them:
> 1) CPU overhead: in case of simple CPU bound load: I already see like 5.5% of
> total CPU spent for metrics in cpu framegraphs for read load and 11% for
> write load.
> Example: [^cpu_profile_insert.html] (search by "codahale" pattern). The
> framegraph is captured using Async profiler build:
> async-profiler-3.0-29ee888-linux-x64
> 2) memory overhead: we spend memory for entities used to aggregate metrics
> such as LongAdders and reservoirs + for MBeans (String concatenation within
> object names is a major cause of it, for each table+metric name combination a
> new String is created)
> LongAdder is used by Dropwizard Counter/Meter and Histogram metrics for
> counting purposes. It has severe memory overhead + while has a better scaling
> than AtomicLong we still have to pay some cost for the concurrent operations.
> Additionally, in case of Meter - we have a non-optimal behaviour when we
> count the same things several times.
> The idea (suggested by [~benedict]) is to switch to thread-local counters
> which we can store in a common thread-local array to reduce memory overhead.
> In this way we can avoid concurrent update overheads/contentions and to
> reduce memory footprint as well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]