[jira] [Comment Edited] (CASSANDRA-20250) Optimize Counter, Meter and Histogram metrics using thread local counters

Stefan Miklosovic (Jira) Tue, 04 Nov 2025 07:18:04 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-20250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18035360#comment-18035360
 ]


Stefan Miklosovic edited comment on CASSANDRA-20250 at 11/4/25 3:16 PM:
------------------------------------------------------------------------

Chiming in as I was author of CASSANDRA-18111.

There are two places where computation on snapshots and going to disk might 
occur.

The first place is {{{}TableSnapshot.computeSizeOnDiskBytes(){}}}. You call 
that either when querying virtual table (system_views.snapshots) or when you do 
listing of snapshots via nodetool / jmx. However, this is computed only once. 
The value is cached and is not going to disk anymore once done. And this is not 
used in metrics anyway.

Secondly, there is TableSnapshot.computeTrueSizeBytes(). Before 
CASSANDRA-18111, this was really going to the disk every time and it was slow, 
but after rewriting (1) what it does is that

1) it is not resolving size of manifest nor schema file size, these are all 
cached / computed just once when snapshot is created / loaded
2) then it lists snapshot files - yes we go to the disk
3) then we iterate over such list and and ever go to disk to resolve the size 
of a particular snapshot only in case that file is not among the result of 
"\{{getLiveFileFromSnapshotFile}}".

The logic behind "true snapshot size" is that if you have 5 SSTables in a table 
and 5 SSTables in a snapshot and they are both same (as it is a hardlink - what 
is in snapshot is in data dir) - then true snapshot size is ... 0. So we do not 
need to go to disk for that. But if a snapshot contains 5 SSTables and we have 
3 SSTables in live data dir, then we need to go to disk and get the size of two 
SSTables - that will be "true size of snapshot".

I do not think "computeTrueSizeBytes" can be more effective than that. Before 
the rewrite,the logic was way more involved and complicated and produced a lot 
of "garbage" as a byproduct of resolving the true size. So if anything then in 
trunk you should see significant speedup at least. There is perf test I was 
conducting as part of (2), also linked here, to see before / after performance 
wise so we should be in way better position right now even without caching.

[^Average_Time_vs_Threads_Combined_snapshot_listing.png]
[^Average_Time_vs_Threads_Combined_true_snapshot_size.png]
[^Percentiles_vs_Threads_Combined_snapshot_listing.png]
[^Percentiles_vs_Threads_Combined_true_snapshot_size.png]
[^Throughput_vs_Threads_Combined_snapshot_listing.png]
[^Throughput_vs_Threads_Combined _true_snapshot_size.png]

(1) 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/snapshot/TableSnapshot.java#L243-L263]
(2) https://issues.apache.org/jira/browse/CASSANDRA-13338


was (Author: smiklosovic):
Chiming in as I was author of CASSANDRA-18111.

There are two places where computation on snapshots and going to disk might 
occur.

The first place is {{{}TableSnapshot.computeSizeOnDiskBytes(){}}}. You call 
that either when querying virtual table (system_views.snapshots) or when you do 
listing of snapshots via nodetool / jmx. However, this is computed only once. 
The value is cached and is not going to disk anymore once done. And this is not 
used in metrics anyway.

Secondly, there is TableSnapshot.computeTrueSizeBytes(). Before 
CASSANDRA-18111, this was really going to the disk every time and it was slow, 
but after rewriting (1) what it does is that

1) it is not resolving size of manifest nor schema file size, these are all 
cached / computed just once when snapshot is created / loaded
2) then it lists snapshot files - yes we go to the disk
3) then we iterate over such list and and ever go to disk to resolve the size 
of a particular snapshot only in case that file is not among the result of 
"getLiveFileFromSnapshotFile".

The logic behind "true snapshot size" is that if you have 5 SSTables in a table 
and 5 SSTables in a snapshot and they are both same (as it is a hardlink - what 
is in snapshot is in data dir) - then true snapshot size is ... 0. So we does 
not need to go to disk for that. But if a snapshot contains 5 SSTables and we 
have 3 SSTables in live data dir, then we need to go to disk and get the size 
of two SSTables - that will be "true size of snapshot".

I do not think "computeTrueSizeBytes" can be more effective than that. Before 
the rewrite,the logic was way more involved and complicated and produced a lot 
of "garbage" as a byproduct of resolving the true size. So if anything then in 
trunk you should see significant speedup at least. There is perf test I was 
conducting as part of (2), also linked here, to see before / after performance 
wise so we should be in way better position right now even without caching.

[^Average_Time_vs_Threads_Combined_snapshot_listing.png]
[^Average_Time_vs_Threads_Combined_true_snapshot_size.png]
[^Percentiles_vs_Threads_Combined_snapshot_listing.png]
[^Percentiles_vs_Threads_Combined_true_snapshot_size.png]
[^Throughput_vs_Threads_Combined_snapshot_listing.png]
[^Throughput_vs_Threads_Combined _true_snapshot_size.png]

(1) 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/snapshot/TableSnapshot.java#L243-L263]
(2) https://issues.apache.org/jira/browse/CASSANDRA-13338

> Optimize Counter, Meter and Histogram metrics using thread local counters
> -------------------------------------------------------------------------
>
>                 Key: CASSANDRA-20250
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20250
>             Project: Apache Cassandra
>          Issue Type: New Feature
>          Components: Observability/Metrics
>            Reporter: Dmitry Konstantinov
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>             Fix For: 5.1
>
>         Attachments: 5.1_profile_cpu.html, 
> 5.1_profile_cpu_without_metrics.html, 5.1_tl4_profile_cpu.html, 
> CASSANDRA-20250_ci_summary.html, CASSANDRA-20250_results_details.tar.xz, 
> Histogram_AtomicLong.png, async_profiler_cpu_profiles.zip, 
> cas_reverse_graph_metrics.png, cpu_profile_insert.html, 
> image-2025-02-18-23-22-19-983.png, jmh-result.json, vmstat.log, 
> vmstat_without_metrics.log
>
>          Time Spent: 11h 50m
>  Remaining Estimate: 0h
>
> Cassandra has a lot of metrics collected, many of them are collected per 
> table, so their instance number is multiplied by number of tables. From one 
> side it gives a better observability, from another side metrics are not for 
> free, there is an overhead associated with them:
> 1) CPU overhead: in case of simple CPU bound load: I already see like 5.5% of 
> total CPU spent for metrics in cpu framegraphs for read load and 11% for 
> write load. 
> Example: [^cpu_profile_insert.html] (search by "codahale" pattern). The 
> framegraph is captured using Async profiler build: 
> async-profiler-3.0-29ee888-linux-x64
> 2) memory overhead: we spend memory for entities used to aggregate metrics 
> such as LongAdders and reservoirs + for MBeans (String concatenation within 
> object names is a major cause of it, for each table+metric name combination a 
> new String is created)
> LongAdder is used by Dropwizard Counter/Meter and Histogram metrics for 
> counting purposes. It has severe memory overhead + while has a better scaling 
> than AtomicLong we still have to pay some cost for the concurrent operations. 
> Additionally, in case of Meter - we have a non-optimal behaviour when we 
> count the same things several times.
> The idea (suggested by [~benedict]) is to switch to thread-local counters 
> which we can store in a common thread-local array to reduce memory overhead. 
> In this way we can avoid concurrent update overheads/contentions and to 
> reduce memory footprint as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-20250) Optimize Counter, Meter and Histogram metrics using thread local counters

Reply via email to