Hi Jon >> Is there a specific workload you're running where you're seeing it take up a significant % of CPU time? Could you share some metrics, profile data, or a workload so I can try to reproduce your findings? Yes, I have shared the workload generation command (sorry, it is in cassandra-stress, I have not yet adopted your tool but want to do it soon :-) ), setup details and async profiler CPU profile in CASSANDRA-20250 <https://issues.apache.org/jira/browse/CASSANDRA-20250> A summary:
- it is a plain insert-only workload to assert a max throughput capacity for a single node: ./tools/bin/cassandra-stress "write n=10m" -rate threads=100 -node myhost - small amount of data per row is inserted, local SSD disks are used, so CPU is a primary bottleneck in this scenario (while it is quite synthetic in my real business cases CPU is a primary bottleneck as well) - I used 5.1 trunk version (similar results I have for 5.0 version while I was checking CASSANDRA-20165 <https://issues.apache.org/jira/browse/CASSANDRA-20165>) - I enabled trie memetables + offheap objects mode - I disabled compaction - a recent nightly build is used for async-profiler - my hardware is quite old: on-premise VM, Linux 4.18.0-240.el8.x86_64, OpenJdk-11.0.26+4, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, 16 cores - link to CPU profile <https://issues.apache.org/jira/secure/attachment/13074588/13074588_5.1_profile_cpu.html> ("codahale" code: 8.65%) - -XX:+DebugNonSafepoints option is enabled to improve the profile precision On Wed, 5 Mar 2025 at 12:38, Benedict Elliott Smith <bened...@apache.org> wrote: > Some quick thoughts of my own… > > === Performance === > - I have seen heap dumps with > 1GiB dedicated to metric counters. This > patch should improve this, while opening up room to cut it further, steeply. > - The performance improvement in relative terms for the metrics being > replaced is rather dramatic - about 80%.. We can also improve this further. > - Cheaper metrics (in terms of both cpu and memory) means we can readily > have more of them, exposing finer-grained details. This is hard to > understate the value of. > > === Reporting === > - We’re already non-standard for our most important metrics, because we > had to replace the Codahale histogram years ago > - We can continue implementing the Codahale interfaces, so that exporting > libraries have minimal work to support us > - We can probably push patches upstream to a couple of selected libraries > we consider important > - I would anyway also support picking a new reporting framework to > support, but I would like us to do this with great care to avoid repeating > our mistakes. I won’t have cycles to actually implement this, so it would > be down to others to decide if they are willing to undertake this work > > I think the fallback option for now, however, is to abuse unsafe to allow > us to override the implementation details of Codahale metrics. So we can > decouple the performance discussion for now from the deprecation > discussion, but I think we should have a target of deprecating > Codahale/DropWizard for the reasons Dmitry outlines, however we decide to > do it. > > On 4 Mar 2025, at 21:17, Jon Haddad <j...@rustyrazorblade.com> wrote: > > I've got a few thoughts... > > On the performance side, I took a look at a few CPU profiles from past > benchmarks and I'm seeing DropWizard taking ~ 3% of CPU time. Is there a > specific workload you're running where you're seeing it take up a > significant % of CPU time? Could you share some metrics, profile data, or > a workload so I can try to reproduce your findings? In my testing I've > found the majority of the overhead from metrics to come from JMX, not > DropWizard. > > On the operator side, inventing our own metrics lib means risks making it > harder to instrument Cassandra. There are libraries out there that allow > you to tap into DropWizard metrics directly. For example, Sarma Pydipally > did a presentation on this last year [1] based on some code I threw > together. > > If you're planning on making it easier to instrument C* by supporting > sending metrics to the OTel collector [2], then I could see the change > being a net win as long as the perf is no worse than the status quo. > > It's hard to know the full extent of what you're planning and the impact, > so I'll save any opinions till I know more about the plan. > > Thanks for bringing this up! > Jon > > [1] > https://planetcassandra.org/leaf/apache-cassandra-lunch-62-grafana-dashboard-for-apache-cassandra-business-platform-team/ > [2] https://opentelemetry.io/docs/collector/ > > On Tue, Mar 4, 2025 at 12:40 PM Dmitry Konstantinov <netud...@gmail.com> > wrote: > >> Hi all, >> >> After a long conversation with Benedict and Maxim in CASSANDRA-20250 >> <https://issues.apache.org/jira/browse/CASSANDRA-20250> I would like to >> raise and discuss a proposal to deprecate Dropwizard/Codahale metrics usage >> in the next major release of Cassandra server and drop it in the following >> major release. >> Instead of it our own Java API and implementation should be introduced. >> For the next major release Dropwizard/Codahale API is still planned to >> support by extending Codahale implementations, to give potential users of >> this API enough time for transition. >> The proposal does not affect JMX API for metrics, it is only about local >> Java API changes within Cassandra server classpath, so it is about the >> cases when somebody outside of Cassandra server code relies on Codahale API >> in some kind of extensions or agents. >> >> Reasons: >> 1) Codahale metrics implementation is not very efficient from CPU and >> memory usage point of view. In the past we already replaced default >> Codahale implementations for Reservoir with our custom one and now in >> CASSANDRA-20250 <https://issues.apache.org/jira/browse/CASSANDRA-20250> we >> (Benedict and I) want to add a more efficient implementation for Counter >> and Meter logic. So, in total we do not have so much logic left from the >> original library (mostly a MetricRegistry as container for metrics) and the >> majority of logic is implemented by ourselves. >> We use metrics a lot along the read and write paths and they contribute a >> visible overhead (for example for plain write load it is about 9-11% >> according to async profiler CPU profile), so we want them to be highly >> optimized. >> From memory perspective Counter and Meter are built based on LongAdder >> and they are quite heavy for the amounts which we create and use. >> >> 2) Codahale metrics does not provide any way to replace Counter and Meter >> implementations. There are no full functional interfaces for these >> entities + MetricRegistry has casts/checks to implementations and cannot >> work with anything else. >> I looked through the already reported issues and found the following >> similar and unsuccessful attempt to introduce interfaces for metrics: >> https://github.com/dropwizard/metrics/issues/2186 >> as well as other older attempts: >> https://github.com/dropwizard/metrics/issues/252 >> https://github.com/dropwizard/metrics/issues/264 >> https://github.com/dropwizard/metrics/issues/703 >> https://github.com/dropwizard/metrics/pull/487 >> https://github.com/dropwizard/metrics/issues/479 >> https://github.com/dropwizard/metrics/issues/253 >> >> So, the option to request an extensibility from Codahale metrics does not >> look real.. >> >> 3) It looks like the library is in maintenance mode now, 5.x version is >> on hold and many integrations are also not so alive. >> The main benefit to use Codahale metrics should be a huge amount of >> reporters/integrations but if we check carefully the list of reporters >> mentioned here: >> https://metrics.dropwizard.io/4.2.0/manual/third-party.html#reporters >> we can see that almost all of them are dead/archived. >> >> 4) In general, exposing other 3rd party libraries as our own public API >> frequently creates too many limitations and issues (Guava is another >> typical example which I saw previously, it is easy to start but later you >> struggle more and more). >> >> Does anyone have any questions or concerns regarding this suggestion? >> -- >> Dmitry Konstantinov >> > > -- Dmitry Konstantinov