[ 
https://issues.apache.org/jira/browse/CASSANDRA-20250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17930669#comment-17930669
 ] 

Maxim Muzafarov commented on CASSANDRA-20250:
---------------------------------------------

I agree with Benedict and have said the same thing to Dmitry privately. I think 
it's better to introduce a new metrics API, deprecate Codehale, and use it only 
as an adapter to maintain backward compatibility. I'm not sure how feasible it 
would be in terms of memory efficiency to use proxy classes or just a 
reflection to set a proper value when it is actually retrieved, but we can try 
to measure it.

Assuming that most users do not use Codehale metric reporters or listeners at 
all, we could also introduce a new API in a disabled mode by default, and if a 
user wants to use more efficient metrics (I think they are still compatible 
with JMX), they could enable via the config.

> Optimize Counter, Meter and Histogram metrics using thread local counters
> -------------------------------------------------------------------------
>
>                 Key: CASSANDRA-20250
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20250
>             Project: Apache Cassandra
>          Issue Type: New Feature
>          Components: Observability/Metrics
>            Reporter: Dmitry Konstantinov
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>             Fix For: 5.x
>
>         Attachments: 5.1_profile_cpu.html, 
> 5.1_profile_cpu_without_metrics.html, 5.1_tl4_profile_cpu.html, 
> Histogram_AtomicLong.png, async_profiler_cpu_profiles.zip, 
> cpu_profile_insert.html, image-2025-02-18-23-22-19-983.png, jmh-result.json, 
> vmstat.log, vmstat_without_metrics.log
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Cassandra has a lot of metrics collected, many of them are collected per 
> table, so their instance number is multiplied by number of tables. From one 
> side it gives a better observability, from another side metrics are not for 
> free, there is an overhead associated with them:
> 1) CPU overhead: in case of simple CPU bound load: I already see like 5.5% of 
> total CPU spent for metrics in cpu framegraphs for read load and 11% for 
> write load. 
> Example: [^cpu_profile_insert.html] (search by "codahale" pattern). The 
> framegraph is captured using Async profiler build: 
> async-profiler-3.0-29ee888-linux-x64
> 2) memory overhead: we spend memory for entities used to aggregate metrics 
> such as LongAdders and reservoirs + for MBeans (String concatenation within 
> object names is a major cause of it, for each table+metric name combination a 
> new String is created)
> LongAdder is used by Dropwizard Counter/Meter and Histogram metrics for 
> counting purposes. It has severe memory overhead + while has a better scaling 
> than AtomicLong we still have to pay some cost for the concurrent operations. 
> Additionally, in case of Meter - we have a non-optimal behaviour when we 
> count the same things several times.
> The idea (suggested by [~benedict]) is to switch to thread-local counters 
> which we can store in a common thread-local array to reduce memory overhead. 
> In this way we can avoid concurrent update overheads/contentions and to 
> reduce memory footprint as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to