I love that we're having a discussion about observability. A HUGE thank you to anyone willing to invest time improving it in Cassandra.
I'd really, really like to see us ship a Prom compatible metrics endpoint out of the box in C* that has low overhead. All the current OSS metrics exporters that I've seen have massive overhead. I'm specifically looking for sub-10s collection on clusters with a thousand nodes and 500+ tables. That means going directly to DropWizard and skipping JMX. I put together a POC of it a while ago here: https://github.com/rustyrazorblade/cassandra-prometheus-exporter. Please use commit 434be099d5983d537e2c70aad745194e575bc49a as a reference. I wasn't expecting anyone to actually care about the repo and the last commit broke it. There's some optimizations that could be done to further improve the exporter, I was working on that when I broke the repo :/ For industry comparison the following DBs either ship entire monitoring stacks or provide strong recommendations / solutions: * ScyllaDB: https://www.scylladb.com/product/scylladb-monitoring-stack/ * Cockroach: https://www.cockroachlabs.com/docs/v24.2/ui-overview-dashboard * Aerospike: https://aerospike.com/docs/monitorstack/new/components-of-monitoring-stack * MongoDB: https://www.mongodb.com/products/platform/atlas-charts/dashboard * Elastic: https://www.elastic.co/guide/en/elasticsearch/reference/8.15/monitoring-production.html * Redis: https://grafana.com/grafana/dashboards/12776-redis/ Re: Logs - I wouldn't write off OTel logging [1]. OTel logs can be tagged with metadata including the span allowing you to do some really useful diagnostics. It's a significant improvement over standard logging. Anyways - I don't have a strong opinion on how the CEPs are done. Different ones or together, whichever works. I hope we can finally get a good metrics solution because that's an area of significant pain for end users. A lot of teams don't even have Cassandra dashboards because we currently provide zero direction. Jon [1] https://opentelemetry.io/docs/specs/otel/logs/ Logs can be correlated with the rest of observability data in a few dimensions: * By the time of execution. Logs, traces and metrics can record the moment of time or the range of time the execution took place. This is the most basic form of correlation. * By the execution context, also known as the trace context. It is a standard practice to record the execution context (trace and span ids as well as user-defined context) in the spans. OpenTelemetry extends this practice to logs where possible by including TraceId and SpanId in the LogRecords. This allows to directly correlate logs and traces that correspond to the same execution context. It also allows to correlate logs from different components of a distributed system that participated in the particular request execution. * By the origin of the telemetry, also known as the Resource context. OpenTelemetry traces and metrics contain information about the Resource they come from. We extend this practice to logs by including the Resource in LogRecords. On Thu, Oct 3, 2024 at 6:11 AM João Reis <joaor...@apache.org> wrote: > Reducing the scope of CEP-32 to OpenTelemetry Tracing is a good idea (or > creating a new one). We recently added OpenTelemetry Tracing support to the > C# driver [1] and we also decided to not include Metrics and Logs in this > initiative because the driver already provides a way to collect metrics and > logs so it's not as important. > > I believe there's also efforts to add OpenTelemetry support to the java > driver but I'm not sure if it's limited to Tracing or if they include > metrics and logs. > > [1] > https://github.com/datastax/csharp-driver/tree/master/doc/features/opentelemetry#readme > > Yuki Morishita <mor.y...@gmail.com> escreveu (terça, 1/10/2024 à(s) > 07:13): > >> Hi, >> >> Since I have limited time working on the CEP-32, I'd appreciate the >> collaboration to make this CEP the reality. >> >> Another thing I'm thinking of is to reduce its scope to only the >> OpenTelemetry configuration and the way to work only with OpenTelemetry >> Tracing. >> >> If it's possible to create sub CEPs, I will create the one for tracing, >> metrics and logs. Otherwise, I can rewrite the current CEP-32 to only focus >> on OpenTelemetry Tracing. >> Or maybe scrap CEP-32 and create a new one for Tracing. >> >> >> On Mon, Sep 23, 2024 at 11:47 AM Saranya Krishnakumar < >> saran.krishna...@gmail.com> wrote: >> >>> Hi Patrick, >>> >>> I am interested in working on this CEP collaborating with Yuki. I >>> recently worked on adding metrics framework in Apache Cassandra Sidecar >>> project. >>> >>> Best, >>> Saranya Krishnakumar >>> >>> On Thu, Sep 19, 2024 at 10:57 AM Patrick McFadin <pmcfa...@gmail.com> >>> wrote: >>> >>>> Here's another stalled CEP. In this case, no discuss thread or Jira. >>>> >>>> Yuki (or anyone else) know the status of this CEP? >>>> >>>> >>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-32%3A+%28DRAFT%29+OpenTelemetry+integration >>>> >>>> Patrick >>>> >>>