Re: Dropwizard/Codahale metrics deprecation in Cassandra server

Dmitry Konstantinov Wed, 05 Mar 2025 06:43:26 -0800

Hi Jon

>>  Is there a specific workload you're running where you're seeing it take
up a significant % of CPU time?  Could you share some metrics, profile
data, or a workload so I can try to reproduce your findings?
Yes, I have shared the workload generation command (sorry, it is in
cassandra-stress, I have not yet adopted your tool but want to do it soon
:-) ), setup details and async profiler CPU profile in CASSANDRA-20250
<https://issues.apache.org/jira/browse/CASSANDRA-20250>
A summary:


   - it is a plain insert-only workload to assert a max throughput capacity
   for a single node: ./tools/bin/cassandra-stress "write n=10m" -rate
   threads=100 -node myhost
   - small amount of data per row is inserted, local SSD disks are used, so
   CPU is a primary bottleneck in this scenario (while it is quite synthetic
   in my real business cases CPU is a primary bottleneck as well)
   - I used 5.1 trunk version (similar results I have for 5.0 version while
   I was checking CASSANDRA-20165
   <https://issues.apache.org/jira/browse/CASSANDRA-20165>)
   - I enabled trie memetables + offheap objects mode
   - I disabled compaction
   - a recent nightly build is used for async-profiler
   - my hardware is quite old: on-premise VM, Linux 4.18.0-240.el8.x86_64,
   OpenJdk-11.0.26+4, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, 16 cores
   - link to CPU profile
   
<https://issues.apache.org/jira/secure/attachment/13074588/13074588_5.1_profile_cpu.html>
("codahale"
   code: 8.65%)
   - -XX:+DebugNonSafepoints option is enabled to improve the profile
   precision


On Wed, 5 Mar 2025 at 12:38, Benedict Elliott Smith <bened...@apache.org>
wrote:

> Some quick thoughts of my own…
>
> === Performance ===
> - I have seen heap dumps with > 1GiB dedicated to metric counters. This
> patch should improve this, while opening up room to cut it further, steeply.
> - The performance improvement in relative terms for the metrics being
> replaced is rather dramatic - about 80%.. We can also improve this further.
> - Cheaper metrics (in terms of both cpu and memory) means we can readily
> have more of them, exposing finer-grained details. This is hard to
> understate the value of.
>
> === Reporting ===
> - We’re already non-standard for our most important metrics, because we
> had to replace the Codahale histogram years ago
> - We can continue implementing the Codahale interfaces, so that exporting
> libraries have minimal work to support us
> - We can probably push patches upstream to a couple of selected libraries
> we consider important
> - I would anyway also support picking a new reporting framework to
> support, but I would like us to do this with great care to avoid repeating
> our mistakes. I won’t have cycles to actually implement this, so it would
> be down to others to decide if they are willing to undertake this work
>
> I think the fallback option for now, however, is to abuse unsafe to allow
> us to override the implementation details of Codahale metrics. So we can
> decouple the performance discussion for now from the deprecation
> discussion, but I think we should have a target of deprecating
> Codahale/DropWizard for the reasons Dmitry outlines, however we decide to
> do it.
>
> On 4 Mar 2025, at 21:17, Jon Haddad <j...@rustyrazorblade.com> wrote:
>
> I've got a few thoughts...
>
> On the performance side, I took a look at a few CPU profiles from past
> benchmarks and I'm seeing DropWizard taking ~ 3% of CPU time.  Is there a
> specific workload you're running where you're seeing it take up a
> significant % of CPU time?  Could you share some metrics, profile data, or
> a workload so I can try to reproduce your findings?  In my testing I've
> found the majority of the overhead from metrics to come from JMX, not
> DropWizard.
>
> On the operator side, inventing our own metrics lib means risks making it
> harder to instrument Cassandra.  There are libraries out there that allow
> you to tap into DropWizard metrics directly.  For example, Sarma Pydipally
> did a presentation on this last year [1] based on some code I threw
> together.
>
> If you're planning on making it easier to instrument C* by supporting
> sending metrics to the OTel collector [2], then I could see the change
> being a net win as long as the perf is no worse than the status quo.
>
> It's hard to know the full extent of what you're planning and the impact,
> so I'll save any opinions till I know more about the plan.
>
> Thanks for bringing this up!
> Jon
>
> [1]
> https://planetcassandra.org/leaf/apache-cassandra-lunch-62-grafana-dashboard-for-apache-cassandra-business-platform-team/
> [2] https://opentelemetry.io/docs/collector/
>
> On Tue, Mar 4, 2025 at 12:40 PM Dmitry Konstantinov <netud...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> After a long conversation with Benedict and Maxim in CASSANDRA-20250
>> <https://issues.apache.org/jira/browse/CASSANDRA-20250> I would like to
>> raise and discuss a proposal to deprecate Dropwizard/Codahale metrics usage
>> in the next major release of Cassandra server and drop it in the following
>> major release.
>> Instead of it our own Java API and implementation should be introduced.
>> For the next major release Dropwizard/Codahale API is still planned to
>> support by extending Codahale implementations, to give potential users of
>> this API enough time for transition.
>> The proposal does not affect JMX API for metrics, it is only about local
>> Java API changes within Cassandra server classpath, so it is about the
>> cases when somebody outside of Cassandra server code relies on Codahale API
>> in some kind of extensions or agents.
>>
>> Reasons:
>> 1) Codahale metrics implementation is not very efficient from CPU and
>> memory usage point of view. In the past we already replaced default
>> Codahale implementations for Reservoir with our custom one and now in
>> CASSANDRA-20250 <https://issues.apache.org/jira/browse/CASSANDRA-20250> we
>> (Benedict and I) want to add a more efficient implementation for Counter
>> and Meter logic. So, in total we do not have so much logic left from the
>> original library (mostly a MetricRegistry as container for metrics) and the
>> majority of logic is implemented by ourselves.
>> We use metrics a lot along the read and write paths and they contribute a
>> visible overhead (for example for plain write load it is about 9-11%
>> according to async profiler CPU profile), so we want them to be highly
>> optimized.
>> From memory perspective Counter and Meter are built based on LongAdder
>> and they are quite heavy for the amounts which we create and use.
>>
>> 2) Codahale metrics does not provide any way to replace Counter and Meter
>> implementations. There are no full functional interfaces for these
>> entities + MetricRegistry has casts/checks to implementations and cannot
>> work with anything else.
>> I looked through the already reported issues and found the following
>> similar and unsuccessful attempt to introduce interfaces for metrics:
>> https://github.com/dropwizard/metrics/issues/2186
>> as well as other older attempts:
>> https://github.com/dropwizard/metrics/issues/252
>> https://github.com/dropwizard/metrics/issues/264
>> https://github.com/dropwizard/metrics/issues/703
>> https://github.com/dropwizard/metrics/pull/487
>> https://github.com/dropwizard/metrics/issues/479
>> https://github.com/dropwizard/metrics/issues/253
>>
>> So, the option to request an extensibility from Codahale metrics does not
>> look real..
>>
>> 3) It looks like the library is in maintenance mode now, 5.x version is
>> on hold and many integrations are also not so alive.
>> The main benefit to use Codahale metrics should be a huge amount of
>> reporters/integrations but if we check carefully the list of reporters
>> mentioned here:
>> https://metrics.dropwizard.io/4.2.0/manual/third-party.html#reporters
>> we can see that almost all of them are dead/archived.
>>
>> 4) In general, exposing other 3rd party libraries as our own public API
>> frequently creates too many limitations and issues (Guava is another
>> typical example which I saw previously, it is easy to start but later you
>> struggle more and more).
>>
>> Does anyone have any questions or concerns regarding this suggestion?
>> --
>> Dmitry Konstantinov
>>
>
>

-- 
Dmitry Konstantinov

Re: Dropwizard/Codahale metrics deprecation in Cassandra server

Reply via email to