Hi Asaf, This is a great topic for discussion, and your document is extremely thorough! I agree with the general proposal to improve Pulsar's metrics.
> *Metrics Cardinality: * +100 if we want to scale Pulsar (and I do!) we need to make this manageable > *Consolidate into a single library:* This makes sense to me, and it ensures that new metrics will not be in one API but not another. I haven't read the whole doc, but I did read the suggested improvements. Here are some additional improvements that I've thought about before. Are there any metrics we can drop? This would definitely require a community effort to verify, but I think it could prove valuable. Can we make the number of histogram buckets configurable? I proposed this here [0]. Would it be possible to produce a script to help users convert existing grafana dashboards to work with the new metrics? Finally, it'd be great to create a metrics section in the contributors guide when you've completed your work. That will help existing and new contributors adjust to the new style. Thanks, Michael [0] https://github.com/apache/pulsar/issues/12069 On Mon, Oct 3, 2022 at 3:36 AM Asaf Mesika <asaf.mes...@gmail.com> wrote: > > Hi All, > > I would like to share with you a document I wrote during the last months > titled Pulsar Metrics - Current State and Future Directions > <https://docs.google.com/document/d/1vke4w1nt7EEgOvEerPEUS-Al3aqLTm9cl2wTBkKNXUA/edit?usp=sharing>, > and most importantly *get your feedback.* > > The initial motivation is to rethink/refactor the way metrics are used in > Pulsar codebase to solve two large pain points: > > 1. *Metrics Cardinality: *As Pulsar can support up to 1M topics > cluster-wide, this translates into ~100M unique time series, which becomes > both an impossible cost and affects query performance and general usability > of metrics. This issue starts surfacing even at 50k-100k topics. > > Today users work-around it by disabling topic-granularity metrics and > scripting their own ETL for generating metrics they can use (based on admin > stats API), switching between granular topic-level metrics to a group-by > view of their choosing. > > The document outlines a solution built upon the notion of Groups, in which > users can define a group of metrics, and specify if they wish to define a > roll-up on it (i.e. remove labels) and filter (i.e. remove specific > metrics). > The solution should be able to bring the granularity from topic level (1M) > to group level (1000). > > 2. *Consolidate into a single library:* Today there are 4 different metrics > libraries/systems in Pulsar. This creates lots of confusion and unhappy > developer experience, among other impacts. Also achieving (1) requires > having (2). > > The document outlines the different libraries, their functionality and the > problems they create. The doc also describes one idea for such a library, > but it still requires a POC. > > > The main goal of the document is mainly to garner feedback to see if the > directions stipulated there are agreed upon, and if there is any other > problem missing or existing functionality missed as it serves as the basis > for the requirements for the solution that will be chosen. > > Thanks! > > Asaf Mesika > > Document link: > https://docs.google.com/document/d/1vke4w1nt7EEgOvEerPEUS-Al3aqLTm9cl2wTBkKNXUA/edit?usp=sharing