This is an automated email from the ASF dual-hosted git repository.
mmerli pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/pulsar.git
The following commit(s) were added to refs/heads/master by this push:
new 0956def00ee [improve][pip] PIP-264: Enhanced OTel-based metric system
(#21080)
0956def00ee is described below
commit 0956def00ee2818ca647f07713b813f7c9813348
Author: Asaf Mesika <[email protected]>
AuthorDate: Fri Sep 1 04:06:01 2023 +0300
[improve][pip] PIP-264: Enhanced OTel-based metric system (#21080)
---
pip/pip-264.md | 1136 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 1136 insertions(+)
diff --git a/pip/pip-264.md b/pip/pip-264.md
new file mode 100644
index 00000000000..aa1420aa756
--- /dev/null
+++ b/pip/pip-264.md
@@ -0,0 +1,1136 @@
+# TOC
+<!-- TOC -->
+* [TOC](#toc)
+* [Preface](#preface)
+* [TL;DR](#tldr)
+* [Background](#background)
+ * [What are metrics?](#what-are-metrics)
+ * [Messaging Metrics](#messaging-metrics)
+ * [OpenTelemetry Basic Concepts](#opentelemetry-basic-concepts)
+* [Motivation](#motivation)
+ * [Lack of a single way to define and record
metrics](#lack-of-a-single-way-to-define-and-record-metrics)
+ * [High topic count (cardinality) is not
supported](#high-topic-count-cardinality-is-not-supported)
+ * [Summary is widely used, but not aggregatable across brokers /
labels](#summary-is-widely-used-but-not-aggregatable-across-brokers--labels)
+ * [Existing Prometheus Export is hard to
use](#existing-prometheus-export-is-hard-to-use)
+ * [Integrating metrics as Plugin author is hard, labor-intensive and
prevents common
functionality](#integrating-metrics-as-plugin-author-is-hard-labor-intensive-and-prevents-common-functionality)
+ * [Lack of ability to introduce another metrics
format](#lack-of-ability-to-introduce-another-metrics-format)
+ * [Adding Rates is error-prone](#adding-rates-is-error-prone)
+ * [Inline metrics documentation is
lacking](#inline-metrics-documentation-is-lacking)
+ * [Histograms can’t be visualized or used since not Prometheus
conformant](#histograms-cant-be-visualized-or-used-since-not-prometheus-conformant)
+ * [Some metrics are delta-reset, making it easy to lose data on
occasions](#some-metrics-are-delta-reset-making-it-easy-to-lose-data-on-occasions)
+ * [Prometheus client metrics use static registry, making them susceptible to
flaky
tests](#prometheus-client-metrics-use-static-registry-making-them-susceptible-to-flaky-tests)
+ * [Function custom metrics are both delta reset and limited in types to
summary](#function-custom-metrics-are-both-delta-reset-and-limited-in-types-to-summary)
+ * [Inconsistent reporting of partitioned
topics](#inconsistent-reporting-of-partitioned-topics)
+ * [System metrics manually scraped from function processes override each
other](#system-metrics-manually-scraped-from-function-processes-override-each-other)
+ * [No consistent naming convention used throughout pulsar
metrics](#no-consistent-naming-convention-used-throughout-pulsar-metrics)
+* [Goals](#goals)
+ * [In Scope](#in-scope)
+ * [Out of Scope](#out-of-scope)
+* [High Level Design](#high-level-design)
+ * [Consolidating to OpenTelemetry](#consolidating-to-opentelemetry)
+ * [Aggregate and Filtering to solve cardinality
issues](#aggregate-and-filtering-to-solve-cardinality-issues)
+ * [Aggregation](#aggregation)
+ * [Filtering](#filtering)
+ * [Removing existing metric toggles](#removing-existing-metric-toggles)
+ * [Summary](#summary)
+ * [Changing the way we measure metrics](#changing-the-way-we-measure-metrics)
+ * [Moving topic-level Histograms to namespace and broker level
only](#moving-topic-level-histograms-to-namespace-and-broker-level-only)
+ * [Integrating Messaging metrics into Open
Telemetry](#integrating-messaging-metrics-into-open-telemetry)
+ * [Switching Summary to Histograms, in namespace/broker level
only](#switching-summary-to-histograms-in-namespacebroker-level-only)
+ * [Specifying units for Histograms](#specifying-units-for-histograms)
+ * [Removing Delta Reset](#removing-delta-reset)
+ * [Reporting topic and partition all the
time](#reporting-topic-and-partition-all-the-time)
+ * [Metrics Exporting](#metrics-exporting)
+ * [Avoiding static metric registry](#avoiding-static-metric-registry)
+ * [Metrics documentation](#metrics-documentation)
+ * [Integration with BK](#integration-with-bk)
+ * [Integrating with Pulsar Plugins](#integrating-with-pulsar-plugins)
+ * [Supporting Admin Rest API Statistics
endpoints](#supporting-admin-rest-api-statistics-endpoints)
+ * [Fixing Rate](#fixing-rate)
+ * [Function metrics](#function-metrics)
+ * [Background](#background-1)
+ * [Solving it in Open Telemetry](#solving-it-in-open-telemetry)
+ * [Collecting metrics](#collecting-metrics)
+ * [Removing 1min metrics](#removing-1min-metrics)
+ * [Supporting `getMetrics` RPC](#supporting-getmetrics-rpc)
+ * [Removing `resetMetrics` RPC](#removing-resetmetrics-rpc)
+ * [Supporting Python and Go Functions](#supporting-python-and-go-functions)
+ * [Summary](#summary-1)
+* [Detailed Design](#detailed-design)
+ * [Topic Metric Group configuration](#topic-metric-group-configuration)
+ * [Integration with Pulsar Plugins](#integration-with-pulsar-plugins)
+ * [Why OpenTelemetry?](#why-opentelemetry)
+ * [What’s good about OTel?](#whats-good-about-otel)
+ * [What we need to fix in
OpenTelemetry](#what-we-need-to-fix-in-opentelemetry)
+ * [Specifying units for histograms](#specifying-units-for-histograms-1)
+ * [Filtering Configuration](#filtering-configuration)
+ * [Which summaries / histograms are
effected?](#which-summaries--histograms-are-effected)
+ * [Summaries](#summaries)
+ * [Histograms](#histograms)
+ * [Fixing Grafana Dashboards in
repository](#fixing-grafana-dashboards-in-repository)
+* [Backward Compatibility](#backward-compatibility)
+ * [Breaking changes](#breaking-changes)
+* [API Changes](#api-changes)
+* [Security](#security)
+* [Links](#links)
+<!-- TOC -->
+
+# Preface
+
+Roughly 11 months ago, I started working on solving the biggest issue with
Pulsar metrics: the lack of ability to monitor a pulsar broker with a large
topic count: 10k, 100k, and future support of 1M. This started by mapping the
existing functionality and then enumerating all the problems I saw (all
documented in this
[doc](https://docs.google.com/document/d/1vke4w1nt7EEgOvEerPEUS-Al3aqLTm9cl2wTBkKNXUA/edit?usp=sharing)).
+
+This PIP is a parent PIP. It aims to gradually solve (using sub-PIPs) all the
current metric system's problems and provide the ability to monitor a broker
with a large topic count, which is currently lacking. As a parent PIP, it will
describe each problem and its solution at a high level, leaving fine-grained
details to the sub-PIPs. The parent PIP ensures all solutions align and does
not contradict each other.
+
+The basic building block to solve the monitoring ability of large topic count
is aggregating internally (to topic groups) and adding fine-grained filtering.
We could have shoe-horned it into the existing metric system, but we thought
adding that to a system already ingrained with many problems would be wrong and
hard to do gradually, as so many things will break. This is why the
second-biggest design decision presented here is consolidating all existing
metric libraries into a single one [...]
+
+I made every effort to summarize this document so that it can be concise yet
clear. I understand it is an effort to read it and, more so, provide meaningful
feedback on such a large document; hence I’m very grateful for each individual
who does so.
+
+I think this design will help improve the user experience immensely, so it is
worth the time spent reading it.
+
+# TL;DR
+Working with Metrics today as a user or a developer is hard and has many
severe issues.
+
+From the user perspective:
+
+* One of Pulsar strongest features is "cheap" topics, so you can easily have
10k - 100k topics per broker. Once you do that, you quickly learn that the
amount of metrics you export via "/metrics" (Prometheus style endpoint) becomes
massive. The cost to store them becomes too high, queries time-out or even
"/metrics" endpoint itself times out, due to heavy performance cost in terms of
CPU and memory to process so many metrics.
+* The only option Pulsar gives you today is all-or-nothing filtering and very
crude aggregation. You switch metrics from topic aggregation level to namespace
aggregation level. Also, you can turn off producer and consumer level metrics.
You end up doing it all leaving you "blind", looking at the metrics from a
namespace level which is too high level. You end up conjuring all kinds of
scripts on top of topic stats endpoint to glue some aggregated metrics view for
the topics you need.
+* Summaries (metric type giving you quantiles like p95) which are used in
Pulsar, can't be aggregated across topics / brokers due to its inherent design.
+* Plugin authors spend too much time on defining and exposing metrics to
Pulsar, since the only interface Pulsar offers is writing your metrics by your
self as UTF-8 bytes in Prometheus Text Format to byte stream interface given to
you.
+* Pulsar histograms are exported in a way that is not conformant with
Prometheus, which means you can't get the p95 quantile on such histograms,
making them very hard to use in day to day life.
+* Too many metrics are rates which also delta reset every interval you
configure in Pulsar and restart, instead of relying on cumulative
(ever-growing) counters and letting Prometheus use its rate function.
+* And many more issues
+
+From the developer perspective:
+
+* There are 4 different ways to define and record metrics in Pulsar: Pulsar
own metrics library, Prometheus Java Client, Bookkeeper metrics library and
plain native Java SDK objects (AtomicLong, ...). It's very confusing for the
developer and creates inconsistencies for the end user (e.g. Summary, for
example, is different in each).
+* Patching your metrics into "/metrics" Prometheus endpoint is confusing,
cumbersome and error-prone.
+* Many more
+
+This proposal offers several key changes to solve that:
+
+* Cardinality (supporting 10k-100k topics per broker) is solved by introducing
a new aggregation level for metrics called Topic Metric Group. Using
configuration, you specify for each topic its group (using wildcard/regex).
This allows you to "zoom" out to a more detailed granularity level, like groups
instead of namespaces, which you control how many groups you'll have, hence
solving the cardinality issue, without sacrificing level of detail too much.
+* Fine-grained filtering mechanism, dynamic. You'll have rule-based dynamic
configuration, allowing you to specify per namespace/topic/group which metrics
you'd like to keep/drop. Rules allow you to set the default to have a small
amount of metrics in group and namespace level only and drop the rest. When
needed, you can add an override rule to "open" up a certain group to have more
metrics at higher granularity (topic or even consumer/producer level). Since
it's dynamic, you "open" such [...]
+
+Aggregation and Filtering combined solves the cardinality without sacrificing
the level of detail when needed and most importantly, you determine which
topic/group/namespace it happens on.
+
+Since this change is so invasive, it requires a single metrics library to
implement all of it on top of; Hence the third big change point is
consolidating all four ways to define and record metrics to a single one, a new
one: OpenTelemetry Metrics (Java SDK, and also Python and Go for the Pulsar
Function runners).
+Introducing OpenTelemetry (OTel) solves also the biggest pain point from the
developer perspective, since it's a superb metrics library offering everything
you need, and there is going to be a single way - only it. Also, it solves the
robustness for Plugin authors which will use OpenTelemetry. It so happens that
it also solves all the numerous problems described in the doc itself.
+
+The solution will be introduced as another layer with feature toggles, so you
can work with existing system, and/or OTel, until gradually deprecating
existing system. Pulsar OTel Metrics will support exporting as Prometheus HTTP
endpoint (`/metrics` but different port) for backward compatability and also
OTLP, so you can push the metrics to OTel Collector and from there ship it to
any destination.
+
+It's a big breaking change for Pulsar users on many fronts: names, semantics,
configuration. Read at the end of this doc to learn exactly what will change
for the user (in high level).
+
+In my opinion, it will make Pulsar user experience so much better, they will
want to migrate to it, despite the breaking change.
+
+This was a very short summary. You are most welcomed to read the full design
document below and express feedback, so we can make it better.
+
+# Background
+* [What are metrics?](#what-are-metrics)
+* [Messaging Metrics](#messaging-metrics)
+* [OpenTelemetry Basic Concepts](#opentelemetry-basic-concepts)
+
+
+## What are metrics?
+
+Any software we know in the world today, exposes metrics to show what
transpires in it in an aggregated way. A metric is defined as:
+
+- A name - e.g. `pulsar_rate_in`
+- Attributes/labels - the context in which the following number applies for.
For example `cluster=europe-cluster, topic=orders`
+- Timestamp - the time at which the following number was measured. Usually
presented in epoch seconds: `1679403600`, which means `Tuesday, March 21, 2023
1:00:00 PM`
+- Value - a numerical value. For example, 15,345. For `pulsar_rate_in` it
means 15,345 bytes received in the interval (e.g., 1min)
+
+Composing it all together looks like this:
+
+`pulsar_rate_in {cluster="europe-cluster", topic="orders"} 1679403600 15345`
+
+Metrics usually come in all kinds of types:
+
+- Counter - a number that keeps increasing. Most of the time used to measure a
rate.
+- Gauge - a number that can go up and down. Example: How many HTTP requests
are in-flight right now to the Admin API for a given broker?
+- Histogram - most of the time, it’s an explicit bucket histogram: you define
several buckets, and each bucket has a range of values. Each time you report a
value to the histogram, it finds the bucket in which the value falls inside
that range and increases its counter by 1.
+- Summary - You record values to it, for example, the latency of API call to
write to Bookkeeper. Once requested, it will give you the percentiles of those
values, over the last X minutes or until it will be reset. For example, The 95
percentile of the last two minutes, is a number for which 95% of all values
recorded to the summary over the last two minutes are below it.
+
+As opposed to logs which tell you a step-by-step story, metrics give you an
aggregated view of Pulsar behavior over time.
+
+## Messaging Metrics
+
+Pulsar main feature is messaging: receiving messages from producers, and
dispatching messages to consumers via subscriptions.
+
+The metrics related to messaging are divided into a couple of aggregation
levels: broker, namespace, topic, subscription, consumer and producer.
+
+Some levels can have high cardinality, hence Pulsar offers several
configuration to minimize the amount of unique time series exported, by the
following toggles:
+
+- `exposeTopicLevelMetricsInPrometheus` - Settings this value to false will
cause metrics to be reported in namespace level granularity only, while if
true, the metrics will be reported in topic level granularity.
+- `exposeConsumerMetricsInPrometheus` - Setting this value to false will
filter out any consumer level metric, i.e. filtering out `pulsar_consumer_*`
metrics.
+- `exposeProducerLevelMetricsInPrometheus` - Setting this value to false will
filter out any producer level metric, i.e. filtering out `pulsar_producer_*`
metrics.
+
+## OpenTelemetry Basic Concepts
+
+- Measurement - the number you record. It can be for example “5” ms in the
case of HTTP request latency histogram, or +2 in the case of an increment to a
counter
+- Instrument - the object through which you record measurements.
+- Instrument Types
+ - Counter
+ - A number that only increases
+ - UpDown Counter
+ - A number that can increase or decrease
+ - Gauge
+ - A number that can increase or decrease, but can’t be aggregated
across attributes. For example: temperature. If room 1 has 35c and room 2 has
40c, you can’t add them to get a meaningful number as opposed to number of
requests.
+ - Histogram
+ - Records numbers and when asked shows a statistical analysis on it.
Example: explicit bucket histogram, which shows count per buckets, where each
bucket represents a value range.
+- Attributes
+ - List of (name, value) pairs. Example: `cluster=eu-cluster, topic=orders`
+ - Usually when recording a value to an instrument, e.g. counter, you do it
in the context of an attribute set.
+- Meter
+ - A factory object through which you create instruments. All created
through it belong to it.
+ - A Meter has a name and a version. Pulsar can have “pulsar” meter with
it’s corresponding version. Plugins can have their own meter, with matching
version.
+ - The name and version will be available via attributes when exported to
Prometheus, or any other time-series database.
+
+# Motivation
+
+* [Lack of a single way to define and record
metrics](#lack-of-a-single-way-to-define-and-record-metrics)
+* [High topic count (cardinality) is not
supported](#high-topic-count-cardinality-is-not-supported)
+* [Summary is widely used, but not aggregatable across brokers /
labels](#summary-is-widely-used-but-not-aggregatable-across-brokers--labels)
+* [Existing Prometheus Export is hard to
use](#existing-prometheus-export-is-hard-to-use)
+* [Integrating metrics as Plugin author is hard, labor-intensive and prevents
common
functionality](#integrating-metrics-as-plugin-author-is-hard-labor-intensive-and-prevents-common-functionality)
+* [Lack of ability to introduce another metrics
format](#lack-of-ability-to-introduce-another-metrics-format)
+* [Adding Rates is error-prone](#adding-rates-is-error-prone)
+* [Inline metrics documentation is
lacking](#inline-metrics-documentation-is-lacking)
+* [Histograms can’t be visualized or used since not Prometheus
conformant](#histograms-cant-be-visualized-or-used-since-not-prometheus-conformant)
+* [Some metrics are delta-reset, making it easy to lose data on
occasions](#some-metrics-are-delta-reset-making-it-easy-to-lose-data-on-occasions)
+* [Prometheus client metrics use static registry, making them susceptible to
flaky
tests](#prometheus-client-metrics-use-static-registry-making-them-susceptible-to-flaky-tests)
+* [Function custom metrics are both delta reset and limited in types to
summary](#function-custom-metrics-are-both-delta-reset-and-limited-in-types-to-summary)
+* [Inconsistent reporting of partitioned
topics](#inconsistent-reporting-of-partitioned-topics)
+* [System metrics manually scraped from function processes override each
other](#system-metrics-manually-scraped-from-function-processes-override-each-other)
+* [No consistent naming convention used throughout pulsar
metrics](#no-consistent-naming-convention-used-throughout-pulsar-metrics)
+
+The current metric system has several problems which act as the motivation for
this PIP. Each subsection below explains the background to the problem and the
actual problem. No prior knowledge is required.
+
+## Lack of a single way to define and record metrics
+
+In Pulsar there are multiple ways to define and record metrics - i.e. several
metric libraries:
+
+- Prometheus Client
+ - Prometheus has a client library in Java, providing objects to define and
record several types of metrics: Gauge, Histogram, Counter (a.k.a. Collectors)
+ - Majority of time, the static Collector registry is used. In some
occasions, collector registries are created.
+- Pulsar Metrics library
+ - Pulsar’s own metric library, providing objects to define and record
several types of metrics:
+ - Histogram: `StatsBuckets`
+ - Rates: `Rate`.
+ - Summary: `Summary` - An extension for Prometheus Client library
providing a more performant version of Summary.
+- Bookkeeper Metrics API Implementation
+ - Apache Bookkeeper (BK) has its own metrics library, divided into an API
and SDK (implementation).
+ - Pulsar has implemented the API, for several purposes
+ - Integrate BK client metrics into Pulsar metrics exporter
(**described below**).
+ - Use BK objects which uses BK Metrics API and integrate their metrics
into Pulsar metrics exporter. Examples: `OrderedExecutors`.
+ - The BK code used in Pulsar is BK Client and OrderedExecutors.
+ - Support Pulsar code which directly uses this API in Pulsar.
`PulsarZooKeeperClient` and several Pulsar plugins are the most prominent
examples
+- Native Java SDK
+ - Plain java objects: `LongAdder` to act as Counters, `AtomicLong` or
primitive long with atomic updater to act as Gauge.
+
+Having multiple metric libraries is a problem for several reasons:
+
+- Confusing
+ - Developers don’t really know which one to use
+- Completely different
+ - Each one of them is different from the other. Prometheus client uses
labels to record values, while Pulsar Metrics library and the Native Java SDK
stores the labels separately and stitches them together only upon exporting
+- Different implementations for same exported type
+ - `Summary` by Pulsar Metrics library uses a fixed time window (1 min) to
reset and start accumulating metrics, while Pulsar Client summary uses a moving
time window of 10 minutes.
+ - `StatsBucket` by Pulsar Metrics library resets its bucket counters every
interval (1 min) while Prometheus Client `Histogram` does not.
+ - This creates confusion both for developers and users
+- Different usage
+ - With Pulsar Metrics library and Java SDK, you must remember to follow
certain conventions to reset the metrics explicitly and register them for
export explicitly. With BK Metrics implementation and Prometheus Client,
exporting is implicitly done for you.
+
+I would summarize it with one word: confusion.
+
+## High topic count (cardinality) is not supported
+
+Pulsar, as users know, is unique by allowing you to use a very high number of
topics in a cluster - up to 1M. It’s not uncommon to find a broker with 10k up
to 100k topics hosted on it.
+
+For each topic, Pulsar exposes roughly 80 - 100 unique time series (metrics).
A single broker with 100k topics will emit 10M unique time series (UTS). This
usually results in the following for the user:
+
+- A single Prometheus, even for a single broker, will not suffice. This forces
the user to switch to complicated distributed time series systems like Cortex,
M3, VictoriaMetrics as they can horizontally scale.
+- If the user works with an observability vendor like DataDog or
[Logz.io](http://Logz.io), the cost of 10M UTS per broker, make it too
expensive to monitor.
+- If the user is fortunate enough to have its own team dedicated to deploying
a time series database, the query will most probably timeout due to the huge
amount of time-series required to read. For vendors, it will either time out or
make the query cost too expensive to use.
+- Heavy performance cost on Pulsar in terms of CPU and memory allocation to
handle the huge amount of topics which translate to many attribute sets.
+
+Hence, the common user behavior is:
+
+- Toggle-off topic level metrics and below (consumer/producer/subscription),
leaving them with only namespace level monitoring
+- Develop their own scripts to call topics `stats` Admin API, to filter only
the metrics they need, and aggregate to a level with reasonable cardinality.
+- Ship it to a vendor and bear the high cost
+
+The filtering supported today is toggle-based (all or nothing) for certain
levels, hence very coarse. You can toggle between namespace to topic level. If
you chose topic level, you can toggle consumers and producers level metrics
separately.
+
+The aggregations provided today are:
+
+- Broker level (just merged this March 2023)
+- Namespace level or Topic level (normal topics and the partitions of a topic,
not partitioned topic level)
+
+## Summary is widely used, but not aggregatable across brokers / labels
+
+Summary, as explain in the Background section, is used to provide quantile
values like p95, p75, for certain measurements - mostly latencies, but
sometimes sizes. It is widely used in Pulsar: Ledger Offloader, Resource
Groups, Replicated Subscriptions, Schema Registry, Broker Operability,
Transaction Managements, Pulsar Functions and Topics, just to name a few.
+
+The biggest problem with Summaries is that quantiles are not aggregatable
mathematically. You can’t know what is the p95 of schema registry “get”
operation latency across the cluster by knowing each p95 per broker. Any math
you’ll do on it will result in large numerical error. This means that beyond
the scope of a broker or label set (i.e. topic/partition) it’s unusable.
+
+For example, the user is mostly interested in the topic publish latency, and
not the topic-partition publish latency. Without aggregating, it’s impossible
to know the publish latency for a topic with partitions, because we need to
aggregate the publish latency of each topic-partition, but we can’t aggregate
summaries as explained above.
+
+This is not the case for Explicit Bucket Histograms or the new type called
Exponential Bucket histogram. They are aggregatable and produce quantiles
(extrapolating) with a small margin of error.
+
+## Existing Prometheus Export is hard to use
+
+Most metrics framework offers you objects you create and then register to a
registry. The exporting is taken care of for you. In Prometheus for example,
you create a Counter, and register it to static collector registry. The
exporter simply exports all collectors (the counter being one) registered to
the static registry.
+
+In Pulsar, since it has 4 different libraries you can use to define a metric,
the exporter had to be written by Pulsar, to patch all of them together into a
single response to the GET`/metrics` request.
+
+If you have used Prometheus Client, you’re all set, as that integration was
written for you. The problem is most usages are not that library, since it has
serious performance issues, especially on high cardinality metrics (like
topics).
+
+Using all other libraries, you’re basically required to write a function that
has the following signature: `void writeMetrics(SimpleTextOutputStream stream)`
for each class containing metrics. Then you add a call to that function in
`PrometheusMetricsGenerator`. The argument `stream` is basically a byte-array
you’re writing bytes into, which represents the response body for `/metrics`
that is about to be delivered. You need to be aware of that, and write the
current state of each metric [...]
+
+This presents multiple problems:
+
+- The logic of printing metrics to a stream in Prometheus format is copied and
pasted in many classes, as there is no types in this stream - it’s just a byte
array.
+- Prometheus Format dictates that all metric data points with the same metric
name be written one after another. The current “API” which just writes text to
a Stream (in Prometheus text format) collides with that since it does not force
that. It forced Pulsar to find an interim solution which is complicated (See
`PrometheusMetricStreams` class), which holds a Stream per metric name.
+- Sometimes even the logic of flushing the stream was implemented more than
once (e.g. `FunctionMetricsResource` writing their own `Writer` using
Heap-based `ByteBuf` instead of direct memory)
+- There’s no single place to define shared labels. For example the `cluster`
label must be added manually by any function.
+- It’s error-prone - you can forget to follow all those steps to export
+- It’s confusing for developers
+- It’s a lot of work for developers, when adding metrics to their features
+
+## Integrating metrics as Plugin author is hard, labor-intensive and prevents
common functionality
+
+Plugins have their own metrics. Most plugins were written to run inside Pulsar
(you supply JARs or NARs loaded on Broker initialization). Pulsar doesn't
provide a single interface through which you create metric objects and register
them and integrate with Pulsar metrics reporting. Due to that, the following
happens:
+
+- Plugin authors choose all sorts of metric libraries: BK Metrics API and SDK,
Prometheus, and more.
+- If they chose Prometheus, and use the static collector, they need to do
nothing this gets emitted with Pulsar metrics. This is not well known nor a
typed way to define interfaces between Pulsar and Plugins.
+- If they chose other libraries, Pulsar provides plugin authors a way to
interface their metric library with Pulsar’s, with the usage of the following
interface: `PrometheusRawMetricsProvider` which contains a single method:`void
generate(SimpleTextOutputStream stream)`. This basically means they need to
implement this function, so it will read the metrics from the framework they
chose, and write it in Prometheus exposition format, in bytes.
+
+Due to that, most plugin developers are forced to write their metric exporting
logic on their own, causing more work to be done for them.
+
+Since the interface is very low level, it creates several difficulties going
forward:
+
+- Making sure Prometheus metrics are printed according to its format (for each
name, all attributes are printed one after the other) is very difficult. You
can’t do that easily with the current interface.
+- If you want to introduce any common mechanism for filtering or any other
work on those metrics, you can’t since it forces you to decode them from the
text format, which will consume too much CPU and memory.
+
+## Lack of ability to introduce another metrics format
+
+Due to multiple libraries and lack of high level metrics interface for
plugins, it’s basically impossible to add another export format in a performant
manner suitable for latency sensitive system such as Pulsar. The metrics system
today is coupled to Prometheus format, thus prevents any addition of a new,
better format.
+
+Take for example, OTLP, a new protocol for Traces/Logs/Metrics. It’s more
optimized than Prometheus, since for example, for histograms, it mentions the
attributes once for all buckets, rather than repeating it for each bucket like
Prometheus format.
+
+OTLP can’t be added to Pulsar as another exporting mechanism, in current
metric system.
+
+## Adding Rates is error-prone
+
+The way `Rate` is built forces the developer to both:
+
+1. Manually add a print of the value of the new instance created, to the
function which writes the metrics in Prometheus format to
`SimpleTextOutputStream` (each class has such a method)
+2. Add a call to `reset()` of the `Rate`instance, to a function which runs
periodically.
+
+Both are not something a developer can understand on its own, therefor easy to
forget to do, or even call twice by mistake. Also wastes time to learn how to
do it.
+
+## Inline metrics documentation is lacking
+
+Each metric, in Prometheus format should contain a line starting with `#HELP`
which allows Prometheus to parse that line and add a description to this
metric, which is later used by UIs like Grafana to be better explain the
available metrics.
+
+Since there are 4 metric libraries, only Prometheus Client offers the typed
option of including a description. Most metrics are not using it, hence lack a
descent help line.
+
+## Histograms can’t be visualized or used since not Prometheus conformant
+
+The main histogram used in Pulsar is `StatsBucket`. It has two major problems:
+
+1. Bucket counters in it are reset to 0 every 1 min (configurable). This goes
against Prometheus assumption that bucket counters only increase. As such, it
prevents using Prometheus functions on it like calculating quantiles
(`histogram_qunatile`), which is the main reason to use histograms.
+2. When exported, the bucket label is encoded in the metric name, and not as
`le` label as Prometheus expects, hence makes it impossible to use
`histogram_quantile` and calculate quantiles on it.
+
+ For example: `pulsar_storage_write_latency_le_10` Should have been
`pulsar_storage_write_latency{le=”10”}`
+
+
+## Some metrics are delta-reset, making it easy to lose data on occasions
+
+`Rate`, `StatsBucket`, `Summary`, some exported JVM metrics and Pulsar
Function metrics are reset to 0 every 1 min (configurable). This means that if
from some reason, Prometheus or any other agent, fails to scrape the metrics
for several minutes, you lost the visibility to Pulsar during those minutes.
When using counters / histograms which are only incremented, the rate is
calculated as delta on the counter values hence if two measurements 5 minutes
apart, will still give you a descent [...]
+
+## Prometheus client metrics use static registry, making them susceptible to
flaky tests
+
+Most usage of Prometheus Client library is done using the static Collector
registry. This exposes it to flaky behavior across tests, as static variables
are shared across tests, and not cleaned between them. When using non-static
registry, it inherently resets itself every new test, but this is not the case
here.
+
+## Function custom metrics are both delta reset and limited in types to summary
+
+A Pulsar Function author has a single way to add metrics to their function,
using method ``void recordMetric(String metricName, double value)` on
`BaseContext` interface.
+
+What it does, is record this `value` in a Prometheus Summary named
`pulsar_function_user_metric_`, under the label `metric={metricName}`.
+
+This Summary metric is also being reset every 1min by the wrapper code running
the user function.
+
+It has the following problems:
+
+- The values are reset every 1min, hence subject to data-loss as presented
above,
+- The user is forced to use a Summary only, and is not offered the ability use
types like Counter, Histogram or Gauge. The user find all sort of hacks around
it to represent counters using summary’s count and sum.
+
+## Inconsistent reporting of partitioned topics
+
+There is a configuration specifying if a Partitioned Topic, composed of
several partitions each is a Topic, will be printed using `topic` label only
(i.e. `topic=incoming-logs-partition-2`) or split into `topic` and `partition`
(i.e. `topic=incoming-logs, partition=2`).
+
+The problem is that this configuration is only applied to messaging related
metrics (namespace, topic, producer, consumer, subscription). It is not applied
to any other metric which contains the topic label, such as Transactions,
Ledger Offloader, etc. This creates inconsistency in reported metrics.
+
+## System metrics manually scraped from function processes override each other
+
+Pulsar Functions are launched by instances (processes) of Pulsar Function
Worker. It supports 3 types of runtimes (function launchers):
+
+1. Thread - run the function in a new thread inside the Function Worker process
+2. Process - launch a process, which will run wrapper code executing the
function (in Java, Python and Go).
+3. Kubernetes - launching a Pod, running the same wrapper code as Process
runtime.
+
+The metrics of the wrapper code, which also includes the function custom
metrics (metrics the function authors adds), are exposed on `/metrics` endpoint
by the wrapper code. In the case of Kubernetes, the pod is annotated such that
Prometheus operators will scrape those metrics directly. In the case of Thread
runtime, the metrics are integrated into the Function Worker metrics. In the
case or Process runtime, the Function Worker is the one scraping the `/metrics`
from each function proce [...]
+
+Process runtime also includes many JVM and system level metrics registered
using Prometheus Client built-in exporters. Due to this (not Pulsar code) it
doesn't contain any special label identifying this function.
+
+When the Function Worker scrapes each `/metric` endpoint, it simply concat the
response, and since no unique label exists, the metrics override each other.
+
+For example, if a Function Worker launched 3 processes, one for each function,
then each will contain `jvm_memory_bytes_used{area="heap"} 2000000`, with
different numeric value. In it there is no unique label. When concatenating the
response from the three functions processes, without any process/function we
will not know from each process this arrived from, and they will override each
other.
+
+## No consistent naming convention used throughout pulsar metrics
+
+- Some domains have a metric prefix, like `pulsar_txn` for transactions
related metrics or `pulsar_schema` for Pulsar Schema metrics. Some don’t, like
metrics related to messaging (topic metrics) - for example `pulsar_bytes_in` or
`pulsar_entry_size_le_*`.
+- Some metrics start with `brk_` while others start with `pulsar_`. Some are
even replaced from `brk_` to `pulsar_` during metric export.
+
+This makes it very hard:
+
+- Defining filters. If you want to exclude messaging related metrics, you
can’t as `pulsar_*` will catch all other pulsar’s metrics.
+- Compose dashboard: It’s easier to type a domain prefix to zoom in on its
metrics, like `pulsar_ledgeroffloader_`but it’s impossible for metrics such as
messaging metrics which doesn't really have a prefix.
+
+# Goals
+
+## In Scope
+
+- Allow monitoring Pulsar broker with very high topic count (10k - 1M),
without paying the price of high cardinality, by providing a mechanism which
aggregates topic-level metrics to an aggregation level called Topic Metric
Group which the operator controls dynamically.
+- Allow dictating (filtering) which metrics will be exported, per any
granularity level metrics: namespace, topic metric group, topic, consumer,
producer, subscription.
+- Replace Summary with Explicit Bucket Histogram
+- Consolidate metrics usage (define, export) to a single library (i.e.
OpenTelemetry)
+- Provide a rich typed interface to hook into Pulsar metrics system for Plugin
authors
+- Make adding a Rate robust and error-free
+- Make histogram reporting conformant with Prometheus when exported to
Prometheus format
+- Stop using static metric registries
+- Provide ability in the future to correlated metrics with logs and traces, by
sharing context
+- Provide a pluggable metrics exporting, supporting a more efficient protocol
(i.e. OTLP)
+- Support the most efficient observability protocol, OTLP.
+- Stop using delta reset, everywhere, including function metrics
+- Provide rich typed interface to define metrics for Pulsar Functions authors
+- All Pulsar metrics are properly named following a well-defined convention,
adhering to [OTel Semantic
Conventions](https://opentelemetry.io/docs/reference/specification/metrics/semantic_conventions/)
for instrument and attribute naming where possible
+- All changed listed above will at least as good, performance wise (CPU,
latency) as current system
+- New system should support the maximum supported number of topics in current
system (i.e. 4k topics) without filtering
+
+## Out of Scope
+
+- Pulsar client metrics
+
+# High Level Design
+
+* [Consolidating to OpenTelemetry](#consolidating-to-opentelemetry)
+* [Aggregate and Filtering to solve cardinality
issues](#aggregate-and-filtering-to-solve-cardinality-issues)
+* [Changing the way we measure metrics](#changing-the-way-we-measure-metrics)
+* [Integrating Messaging metrics into Open
Telemetry](#integrating-messaging-metrics-into-open-telemetry)
+* [Switching Summary to Histograms, in namespace/broker level
only](#switching-summary-to-histograms-in-namespacebroker-level-only)
+* [Reporting topic and partition all the
time](#reporting-topic-and-partition-all-the-time)
+* [Metrics Exporting](#metrics-exporting)
+* [Avoiding static metric registry](#avoiding-static-metric-registry)
+* [Metrics documentation](#metrics-documentation)
+* [Integration with BK](#integration-with-bk)
+* [Integrating with Pulsar Plugins](#integrating-with-pulsar-plugins)
+* [Supporting Admin Rest API Statistics
endpoints](#supporting-admin-rest-api-statistics-endpoints)
+* [Fixing Rate](#fixing-rate)
+* [Function metrics](#function-metrics)
+
+
+## Consolidating to OpenTelemetry
+
+We will introduce a new metrics library that will be the only metrics library
used once this PIP implementation reaches its final phase. The chosen library
is OpenTelemetry Java SDK. Full details on what is OpenTelemetry and why it was
chosen over other metric libraries is located at the [Detailed Design
section](#why-opentelemetry) below. This section focuses on describing it in
high level.
+
+OpenTelemetry is a project that provides several components: API, SDK,
Collector and protocol. It’s main purpose is defining a standard way and
reference implementation to define, export and manipulate observability
signals: metrics, logs and traces. In this explanation we’ll focus on the
metrics signal. The API is a set of interfaces used to define and report
metrics. The SDK is the implementation of that API and also added functionality
such as export of metrics and ability to change h [...]
+
+The project’s API also has a part for reporting logs and traces. One its core
features is the ability to share common context that can be used across
metrics, logs and traces reporting (think Thread Local contains an attribute
which is also used when reporting metric’s baggage, logs and traces, hence
creates a link between them).
+
+In this PIP scope we will only use the Metrics API and SDK. Specifically we
will use the Java SDK in Pulsar Broker, Proxy and Function Worker, but also the
Go and Python SDK for the wrapper code which executes functions written in Go
and Python.
+
+We will keep the current metric system as is, and add a new layer of metrics
using OpenTelemetry Java SDK: All of Pulsar’s metrics will be created *also*
using OpenTelemetry. A feature flag will allow enabling OpenTelemetry metrics
(init, recording and exporting). All the features and changes described here
will be done only in the OpenTelemetry layer, allowing to keep the old version
working until you’re ready to switch using the OTel (OpenTelemetry)
implementation. In the far future, o [...]
+There's no need to have an abstraction layer on both the current metric system
and the new one (OTel) as OpenTelemetry API *is* the abstraction layer, and
it's the industry standard.
+
+One very big breaking change (there are several described in the High Level
Design, and also summarized in the Backward Compatibility section) is the
naming. We are changing all metric names due to several reasons:
+
+1. Attributes names (a.k.a. Labels) will utilize the Semantic Conventions
[defined](https://opentelemetry.io/docs/concepts/semantic-conventions/) in OTel.
+ 1. OpenTelemetry defined an agreed upon attribute names for many
attributes in the industry.
+2. Histograms bucket names will be properly encoded using `le` attribute (when
exported to Prometheus format) and not inside the metric name.
+3. Each domain in Pulsar will be properly prefixed (messaging, transactions,
offloader, etc.)
+
+The move to OpenTelemetry is not without cost, as we both need to improve the
library and adjust the way we use metrics in Pulsar with it. The Detailed
Design contains a section detailing exactly what we need to improve in
OpenTelemetry to make sure it fits Apache Pulsar. It mostly consists of making
it performant (almost allocation free), and small additions like removing
attribute sets limit per instrument.
+
+The changes we need to make in Pulsar to use it are detailed in this high
level design section below. It’s composed of two big changes: changing all
histograms to be only at namespace level instead of topic level, and switching
from Summary to Histogram (Explicit Bucket).
+
+There will be a sub-PIP detailed exactly how OpenTelemetry will be integrated
into Pulsar, detailing how users will be able to configure it: their own views,
exporters, etc. We need to see how to support that, given we want to introduce
our own filtering predicate.
+
+Here's a very short *idea* of how the code will look like using OpenTelemetry.
In reality, we will use batch callback and different design. The sub-PIP will
specify the exact details.
+
+```java
+class TopicInstruments {
+ // ...
+ meter.counterBuilder("pulsar.messaging.topic.messages.received.size")
+ .setUnit("bytes")
+ .buildWithCallback(observableLongMeasurement -> {
+ for (topic : getTopics()) {
+ val size = topic.getMessagesReceivedSize();
+ observableLongMeasurement.record(size, topic.getAttributes());
+ }
+ });
+ // ...
+}
+
+
+class PersistentTopic {
+ // ...
+ Attributes topicAttributes
+ LongAdder messagesReceivedSize
+ // ...
+
+ init(String topicName) {
+ Attributes topicAttributes = Attributes.builder()
+ .put("topic", topicName)
+ .put("namespace", namespaceName)
+ .build();
+ }
+ // ...
+
+
+ messageReceived() {
+ // ...
+ messagesReceivedSize.add(msgSize);
+ }
+
+ getMessagesReceivedSize() {
+ return messagesReceivedSize.value();
+ }
+}
+```
+
+## Aggregate and Filtering to solve cardinality issues
+
+### Aggregation
+
+When you have cardinality issues, specifically topics being the root cause
since Pulsar support up to 1M topics cluster wide, roughly translated to 100k
topics per single broker. The best way to solve it is to reduce the cardinality.
+
+We will introduce a new term called Topic Metric Group. Each topic will be
mapped to a single group. The groups are the tool we’ll use to reduce
cardinality to a descent level, which roughly means 10 - 10k. A cardinality
scale your time series database can handle properly.
+
+For each metric Pulsar currently have in topic-level (topic is one of its
attributes), we will create another metric, bearing a different name which will
have its attributes in the group level (`topicMetricGroup` will be the
attribute, without `topic` attribute). Each time we’ll record a value in a
topic metric, we’ll also record it in the equivalent Topic Metric Group metric.
For consistency, we will do the same also for namespace level.
+
+Example:
+
+If today you chose topic level metric, which means you don’t have namespace
level metrics, you’ll have the following metric:
+
+```jsx
+pulsar_in_messages_total{topic="orders_company_foobar", namespace="finance",
...}
+```
+
+After the change (including naming changes discussed later), you’ll have:
+
+```jsx
+pulsar_messaging_**namespace**_in_messages_total{namespace="finance"}
+pulsar_messaging_**group**_in_messages_total{topicMetricGroup="orders",
namespace="finance"}
+pulsar_messaging_**topic**_in_messages_total{topic="orders_company_foobar"
topicMetricGroup="orders",
+ namespace="finance"}
+```
+
+There will be a dynamic configuration allowing to specify the mapping from
topic to topic group, in a convenient way. Since there can be many ways to
configure that, we’ll define a plugin interface which will provide that mapping
given a topic, and we’ll provide a default implementation for it. This will
allow advanced users to customize to their needs.
+
+The detailed design section contains a more detailed description of this
mapping configuration. There will be a sub-PIP detailing how the mapping will
be configured, which tools we need to add to allow knowing which topics are in
each group, and more.
+
+### Filtering
+
+The aggregation above using both namespace and Topic Metric Group, reduces the
cardinality. The only thing left for us to do is add a great user experience
tool to control which metrics you’d like to see exported, in fine granularity.
A bit like logging levels, but with more fine-grained controls.
+
+We would like to have a dynamic configuration allowing us to toggle which
metric we wish to keep or drop. For example:
+
+- Keep namespace and topic metric group level metrics, for any namespace, any
group.
+- For any topic, keep only backlog size metric, and drop anything else under
“messaging” scope metrics.
+- For topic “incoming-logs”, keep all of its metrics.
+- For topic group “orders”, keep all metrics in topic level, but drop
producer/consumer/producer level metrics.
+- For topic “billing”, keep all metrics (up to producer/consumer level).
+
+We will introduce a new configuration, containing Filtering Rules, evaluated
in order. Each rule will define a selector allowing you to select (metric,
attributes) pairs based on all sorts of criteria, and then defining actions
such as drop all, keep all, keep only, drop only, determining for each (metric,
attribute) pair if it should be filtered out or not.
+
+This configuration will be dynamic, hence allowing you to build namespace /
group level dashboards to monitor Pulsar. Once you see a group misbehaving, you
can dynamically “open” the filter to allow this group’s topic metric, and you
can decide to only allow the certain metric you suspect is problematic. Once
you find out the topic, you allow (stop dropping) all of its metrics, to enable
you to debug it. Once you’re done, you can roll back to group level metrics.
+
+We want to allow users to customize and build a filter matching their needs,
hence we’ll create a plugin interface, which can decide for a given metric
(name, attributes, unit, etc.) if they want to filter it or not.
+
+The detailed design section will include a more detailed description of the
plugin interface and default implementation configuration file. It will also
include explanation how we plan to implement that in stages: 1st stage on our
own and 2nd stage by introducing push down predicate into OpenTelemetry
MetricReader.
+
+### Removing existing metric toggles
+
+The fine-grained filtering mechanism described makes using the configuration
toggles we have today deprecated hence we will remove them. This includes
`exposeConsumerLevelMetricsInPrometheus`, `exposeTopicLevelMetricsInPrometheus`
and `exposeProducerLevelMetricsInPrometheus`.
+
+### Summary
+
+The aggregation to groups and namespaces which are of reasonable cardinality,
coupled with the ability to decide on which metrics and specific attribute sets
in them, you wish to export using the filters, solves the cardinality issue.
Visibility to monitoring data is not sacrificed since dynamic configuration
allows you to zoom in to get the finer details and later shut it off.
+
+## Changing the way we measure metrics
+
+### Moving topic-level Histograms to namespace and broker level only
+
+Histograms are primarily used to record latencies:
`pulsar_storage_write_latency_le_*`,
`pulsar_storage_ledger_write_latency_le_*`, and `pulsar_compaction_latency_*`
+
+We have several issues with them:
+
+- They cost a lot. Each histogram today translates to 12 unique time series,
which means it costs like x12 more than a counter.
+- A broker is mostly a multi-tenant process by design. It’s almost never a
single topic’s fault for latency, and even if it is, other topics will be
affected as well. There is contention mostly on CPU of broker, and Bookkeeper,
which both are shared across topics. Having it in topic level won’t help to
diagnose topic root cause, and sometime mislead you.
+
+The other problem we have is the topic move. Topics can move between brokers,
due to automatic load balancing, or an operator decision to unload a topic.
Today, when a topic is unloaded, the broker stops reporting the metrics for it.
+
+In OTel, the API doesn't support `remove()` for an attribute set on an
instrument (counter, histogram, etc.). Normally, if it was supported, we could
have called `remove` for each instrument which contains an attribute-set for
that topic.
+
+OTel has two categories of instruments: synchronous and asynchronous.
Synchronous instruments, keeps the value of the object in memory and updates it
upon recording a new value (behind the scenes: it adds for example +2 to an
`AtomicLong` it maintains for a Counter and the attribute set). Asynchronous
instruments on the other hand are defined using a callback. When a call is made
to collect the values for each instrument, a callback is invoked to retrieve
the (attributes, value) pairs. w [...]
+
+Since `remove()` doesn't exist for instruments, async instruments are the
closest thing we have in OTel. Thus Counter and UpDownCounters for topic
instruments can be asynchronous instruments, hence when a topic is unloaded,
the next callback will simply not record their values.
+
+Histograms are problematic in that sense, since they yet to have an
asynchronous version - alas they are only synchronous. Hence, if we use them
for topic instruments, used attributes can never be cleared from memory for
them thus we’ll have an “attributes” (topic) leak, as over time it will only
grow. This is why currently we can’t use them for topics or topic groups.
+
+We have opened an
[issue](https://github.com/open-telemetry/opentelemetry-specification/issues/3062)
and making progress, but this is a long process.
+
+Coupling the two together - cost, confusion, lack of clear use, and lack of
support from OTel - we will change those topic level histograms to be namespace
/ broker level.
+
+## Integrating Messaging metrics into Open Telemetry
+
+As explained before, instruments don’t support `remove()`. Topic, its
subscriptions, producers and consumers - each has its own set of metrics. Also,
each is ephemeral. Topics can move or be deleted, producers and consumers can
stop and start many times, and subscriptions move with topics.
+
+Hence, for topic, producers, consumer and subscription (and topic group) we
will use asynchronous instruments. This means we’ll keep the state using our
own `LongAdder`, `AtomicLong` or primitive long using atomic updater. When
creating the asynchronous instrument, the callback will retrieve the value from
the variable. For example, when a producer is removed, in the next collection
cycle, the callback will not report metrics for it since it doesn't exist, and
thus it will disappear from [...]
+
+OTel has a special batch callback mechanism, allowing you to supply a single
callback for multiple instruments, making it more efficient, and we plan to use
it.
+
+As explained before, we’ll have instruments per aggregation level: broker,
namespace, topic group, topic, subscription, producer and consumer.
+
+Broker and namespace level will use synchronous instruments, since the amount
of namespaces is not expected to have high cardinality, hence not removing them
is not a big attribute leak issue. Asynchronous instruments are also needed es
explain in “Supporting Admin Rest API Statistics endpoints” section below.
+
+## Switching Summary to Histograms, in namespace/broker level only
+
+Summary by design can’t be aggregated across topics or hosts (See Background
and Motivation). That was the reason OTel doesn't support them. I opened an
[issue](https://github.com/open-telemetry/opentelemetry-specification/issues/2704)
for that, but it doesn't seem like something that can be added to OTel.
+
+Another consideration is CPU. In benchmarks done, it seems that updating the
summaries that are based on Apache Data Sketches cost 5% of CPU time, which is
a lot compared with a simple +1 to a counter in a bucket in an explicit bucket
histogram.
+
+Due to those reasons, it makes sense to switch all summaries to histograms
(explicit bucket type).
+
+From the same reasons as histograms, they will be broker/namespace level only
(not topic).
+
+Most summaries are used for latency reporting, from same reasons of multi
tenancy and careful inspection of existing summaries, we’ve concluded we can
convert them to be namespace level. The domains affected by it are:
+
+- LedgerOffloader stats
+- Replication of subscription snapshot
+- Transactions
+ - Transaction Buffer client
+ - Pending Acks
+
+The complete list of which summaries are affected is at the Detailed Design
section
+
+Pulsar Functions metrics uses summary for user defined metrics, but we assume
the quantiles are actually meaningless since some use it to record the value
“1” just to obtain count, and some record value to obtain a sum. We will
convert them to Histogram without buckets, just providing sum and count.
+
+- Each custom metric is actually an attribute set bearing the attribute
`metric={user-defined-name}`
+- We will define that in the init based on instrument name using views.
+
+### Specifying units for Histograms
+
+We’ve inspected our summaries and histograms, and it seems the bucket ranges
are common per unit: ms, seconds and bytes.
+
+OTel has the notion of views. They are created upon init of OTel and can be
used to specify the buckets for a set of instruments based on instrument
selector (name wildcard, …). We have opened an
[issue](https://github.com/open-telemetry/opentelemetry-specification/issues/3101)
for OTel SDK Specifications, which was merged to allow specifying a unit in an
instrument selector. We only need to implement it in OTel Java SDK (See
[issue](https://github.com/open-telemetry/opentelemetry-java/i [...]
+
+### Removing Delta Reset
+
+As described in the motivation and background section, Pulsar reset certain
type of metrics every configurable interval (i.e. 1min) : rates, explicit
bucket histograms and summaries.
+
+As also explained, it makes hard to use for histograms, and redundant and less
accurate for rates. Most time-series databases can easily calculate rates based
on ever-increasing counters.
+
+Hence, in OpenTelemetry we won’t report rates, we’ll switch to counters. That
means of course name changing, but as described in this document, all names
will be changed anyhow.
+
+For histograms, we will simply never reset them.
+
+Summaries are converted to histograms so also never reset anymore.
+
+We will keep Rates around primarily for the statistics returned through Admin
API (i.e. Topic Stats, …), but they will never be exposed through OpenTelemetry.
+
+It is worth noting that OpenTelemetry supports the notion of Aggregation
Temporality. In short, it allows you to define for a given instruments if you
want it to be exported as delta or cumulative. Some exporters like Prometheus
only support Cumulative, and will override it all to be such. OTLP supports
delta. Currently, we’re not explicitly supporting configuring views / readers
to allow that, but it’s something very easily added in the future, by the
community. It will be perfect for p [...]
+
+## Reporting topic and partition all the time
+
+Today a topic is reported using the attribute `topic={topicName}`. If the
topic is actually a partition for a partitioned topic, it will look like
`topic={partitionedTopicName}-partition-{partitionNum}`. There is a
configuration name `splitTopicAndPartitionLabelInPrometheus` which makes that
be reported instead as `topic={partitionedTopicName}, partition={partitionNum}`.
+
+This is not consistent, in such that not all metrics using `topic` attribute
did the split accordingly.
+
+In OTel metrics we will ignore that flag and always report it as
`topic={partitionedTopicName}, partition={partitionNum}`. We will make it
consistent across any `topic` attribute usage. Eventually this flag will be
removed (probably in the next major version of Pulsar).
+
+## Metrics Exporting
+
+OTel has a built-in Prometheus `MetricReader` (and exporter) which exposes
`/metrics` endpoint, which we will use.
+
+Pulsar current metric system has a caching mechanism for `/metrics` responses.
This was developed since creating the response for high topic count was a CPU
hog and in some cases memory hog. In our case we plan to use Filtering and
Aggregation (Metric Topic Group) to drive the response to a reasonable size,
hence won’t need to implement that in OTel metrics.
+
+OTel also has built-in OTLP exporter. OTLP is the efficient protocol OTel has,
which OTel Collector supports, and some vendors as well. We wish to use it, yet
it seems that it is very heavy on memory allocation. Hence, we will need to
improve it to make it allocation free as much as possible.
+
+## Avoiding static metric registry
+
+OTel supports a static (Global) OTel instance, but we will refrain from it to
make sure test data doesn't leak between tests.
+
+In OTel we will create an instance of `MeterProvider` during Pulsar init. This
object is the factory for `Meter` which by itself is a factory for instruments
(Counter, Histogram, etc.). We will pass along a Pulsar `Meter` instance to
each class that needs to create instruments. The exact details will be detailed
in a sub-PIP of adding OpenTelemetry to Pulsar.
+
+## Metrics documentation
+
+OTel doesn't force you to supply documentation.
+
+We will create a static code analysis rule failing the build if it finds
instrument creation without description.
+
+We will optionally try to somehow create an automated way to export all
metrics metadata - instrument name, documentation, in an easy-to-read format to
be used for documentation purposes on Pulsar website.
+
+## Integration with BK
+
+BookKeeper (client and server) has their own custom metrics library. It’s
built upon a set of interfaces. Pulsar metric system has an implementation for
it.
+
+We will create another implementation to patch it to OTel.
+
+## Integrating with Pulsar Plugins
+
+We will modify all popular plugin interfaces Pulsar has, such that they will
accept an OpenTelemetry instance to be used to grab the `MeterProvider`
instance and use it to create their own `Meter` with the plugin name and
version which they will use to create their own instruments.
+
+A user which has decided to turn on OTel metrics will have to verify all
Pulsar plugins it uses have been upgraded to use the modified interface,
otherwise their metrics will not be exported.
+
+Plugin authors will need to release a new version of their plugin which
implements the new interface and registers the metrics using `OpenTelemetry`.
+
+One big advantage is that using OTel supports plugins running in stand-alone
mode. Some plugins have the option to run some of their code outside Pulsar. By
using OTel API, they can integrate either via their own SDK or Pulsar SDK (via
`OpenTelemetry` instance).
+
+A detailed list is at the Detailed Design.
+
+A sub-PIP will be dedicated to integrating with plugins.
+
+## Supporting Admin Rest API Statistics endpoints
+
+Pulsar has several REST API endpoints for retrieving detailed metrics. They
are exposing rates, up-down counters and counters.
+
+OTel instruments doesn't have methods to retrieve the current value. The only
facility exposing that is the `MetricReader` that reads the entire metric set.
Thus, any metric that is exposed also through Admin REST API will have to have
its state maintained by Pulsar, either using `LongAdder`, `AtomicLong` or
primitive long with atomic updater. The matching OTel instrument will be
defined as Asynchronous instrument, meaning it is defined by supplying a
callback function will be executed t [...]
+
+## Fixing Rate
+
+We will change `Rate`, in such a way that won’t require the user creating it,
calling `reset()` periodically and manually. All rates will be created via a
manager class of some sort, and it will be the one responsible for scheduling
resets. Rates will always expose a sum counter and a counter to save up on
multiple variables. A sub-PIP will explain in detail how this will be achieved.
+
+## Function metrics
+
+### Background
+
+Pulsar supports the notion of Pulsar Functions. It is user-supplied functions,
either in Go, Java or Python, which can read messages, and write messages. You
can use it to read all messages from a topic and write them to external system
like S3 (Sink), or the other way around: read from an external system and write
it to a topic (e.g. DB Change log to Pulsar topic - source). The other option
is simply transforming the message received and writing it to another topic.
+
+A user can submit a function, and can also configure the amount of instances
it will have.
+
+The code responsible for coordinating the execution of those functions is
located in a component called Function Worker, which can be run as stand-alone
process or as part of Pulsar process. You can run many Function Workers, yet
only one function as the leader.
+
+The leader takes care of splitting the work of executing the function
instances between the different Function Workers (for load balancing purposes).
+
+The Function worker has three runtimes, as in, three options to execute the
functions it is in charge of:
+
+1. Thread: Creating a new thread and running the function in it.
+2. Process: Creating a new process and running the function in it.
+3. Kubernetes: Creating a Deployment for each function.
+
+Each Function Worker has its own metrics it exposes.
+
+Each Function has its own metrics it exposes.
+
+In the Thread runtime, all metrics are funneled into Pulsar metrics (exposing
a method which writes them into the `SimpleTextOutputFormat`).
+
+In the Process runtime, the function is executed in its own process, hence
there is a wrapper main() function executing the user supplied function in the
process. This main() function has general function execution metrics (e.g. how
many messages received, etc.), and also the function metrics (user custom
metrics). All the metrics are defined using a single library: Prometheus
client. The metrics are exposed using the client’s HTTP server exposing
`/metrics` endpoint exposing the two cat [...]
+
+In the Kubernetes runtime, the pod is annotated with Prometheus Operator
annotation, and it is expected it will be installed, hence Prometheus will
scrape those metrics directly from the process running in the pod.
+
+In the Process runtime, the Function Worker is iterating over all processes it
launched, and for each it issues a GET request to `/metrics`. The responses are
concatenated together and printed to `SimpleTextOutputStream`.
+
+The general function execution metrics comes in two forms: cumulative and 1
min ones. The latter names ends with `_1min`, and they get reset every 1min.
(e.g. `*_received_1min`).
+
+Each process launched also launches a gRPC server supporting commands. Two of
those are related to metrics: `resetMetrics` and `getAndResetMetrics`. They
reset all metrics - both the general framework ones and the customer user ones.
+
+Prometheus client was also configured to emit several metrics using built in
exporters: memory pool, JMX metrics, etc.
+
+### Solving it in Open Telemetry
+
+In phase 1, we’ll keep the reporting of user defined metrics for function
authors as is, and mainly focus on the other issues which are: metrics scraping
for each runtime, and the 1min metrics. In phase 2, we’ll also add the option
to define metrics for pulsar function authors via OTel.
+
+### Collecting metrics
+
+**Thread Runtime**
+
+The framework will use Pulsar OTel SDK, or it’s standalone function worker
SDK, thus what ever export method it uses, it will use as well (prometheus,
OTLP, …)
+
+**Kubernetes Runtime**
+
+OTel supports exporting metrics via `/metrics` endpoint using Prometheus
format. We’ll support the same as it is today done with Prometheus client.
+
+We’ll also support configuring the pod, so it can send metrics via OTLP to
defined destination.
+
+**Process Runtime**
+
+The existing scraping of `/metrics` solution was not good:
+
+- It violated Prometheus format, since same name, different attributes lines
must be one after another, and in reality it was concat as is.
+- The prometheus exporters metrics like memory pools, didn't have any unique
attribute for each process, hence when concat, they would have same name same
attributes different values, from different processes hence be lost.
+
+OTel supports both exporting metrics as Prometheus and pushing them using
OTLP. Making the Function worker pull the metrics, in effect - to be the hub -
is super complicated. It’s much easier to simply let the processes be scraped
or push OTLP metrics - configured the same as Pulsar is.
+
+At phase 1 we won’t support Process Runtime.
+
+At phase 2, we can use [Prometheus HTTP Service
Discovery](https://prometheus.io/docs/prometheus/latest/http_sd/), and expose
such an endpoint in the Function Worker leader. Via health pings it gets from
each worker, they can also report each process metrics port, thus allowing
prometheus to scrape the metrics directly from each process. We’ll garner
feedback from the community to see how important is Process runtime, as we have
K8s runtime which is much more robust.
+
+### Removing 1min metrics
+
+We’ll not define any 1min metric. Any TSDB can calculate that from the
cumulative counter.
+
+### Supporting `getMetrics` RPC
+
+We can define our own `MetricReader` which we can then filter to return the
same metrics as we return today: general function metrics and user-defined
metrics.
+
+### Removing `resetMetrics` RPC
+
+OTel doesn't support metric reset, and it also violates Prometheus since it
expects metrics to be cumulative. Thus, we will remove that method.
+
+### Supporting Python and Go Functions
+
+OTel has an SDK for Python and Go, thus we’ll use it to export metrics.
+
+### Summary
+
+A sub-pip will be created for Function Metrics which will include detailed
design for it.
+
+# Detailed Design
+
+* [Topic Metric Group configuration](#topic-metric-group-configuration)
+* [Integration with Pulsar Plugins](#integration-with-pulsar-plugins)
+* [Why OpenTelemetry?](#why-opentelemetry)
+* [What we need to fix in OpenTelemetry](#what-we-need-to-fix-in-opentelemetry)
+* [Specifying units for histograms](#specifying-units-for-histograms-1)
+* [Filtering Configuration](#filtering-configuration)
+* [Which summaries / histograms are
effected?](#which-summaries--histograms-are-effected)
+* [Fixing Grafana Dashboards in
repository](#fixing-grafana-dashboards-in-repository)
+
+## Topic Metric Group configuration
+
+As mentioned in the high level design section, we’ll have a plugin interface,
allowing to have multiple implementations and to customize the way a topic is
mapped to a group.
+
+The default implementation this PIP will provide will be rule based, and
described below.
+
+We’ll have a configuration, that can look something like this:
+
+```hocon
+bi-data // group name
+ namespace = bi-ns // condition of the form: attribute name = expression
+ topic = bi-* // condition of the form: attribute name = expression
+
+incoming-logs
+ namespace = *
+ topic = incoming-logs-*
+```
+
+The configuration will contain a list of rules. Each rule begins with a group
name, and list of matchers: one for namespace and one for topic. The rules will
be evaluated in order, and once a topic was matched, we’ll stop iterating the
rules.
+
+There will be a sub-PIP detailing the plugin and default implementation in
fine-grained detail: where it will be stored, how it will support changing it
dynamically, performance, etc.
+
+## Integration with Pulsar Plugins
+
+These are the list of plugins Pulsar currently uses
+
+- `AdditionalServlet`
+- `AdditionalServletWithPulsarService` - we can use `PulsarService` to
integrate
+- `EntryFilter` - need to add `init` method to supply OTel
+- `DelayedDeliveryTrackerFactory` - accepts PulsarService
+- `TopicFactory` - accepts BrokerService
+- `ResourceUsageTransportManager` - need to add `init` method
+- `AuthenticationProvider` - need to add parameter to `initialize` method
+- `ModularLoadManager` - accepts `PulsarService`
+- `SchemaStorageFactory` - accepts `PulsarService`
+- `JvmGCMetricsLogger` - need to add init().
+- `TransactionMetadataStoreProvider` - need to add init
+- `TransactionBufferProvider` - need to add init
+- `TransactionPendingAckStoreProvider` - need to add init
+- `LedgerOffloader` - need to add init()
+- `AuthorizationProvider` - need to add parameter to `initialize` method
+- `WorkerService` - need to modify init methods
+- `PackageStorageProvider`
+- `ProtocolHandler` - need to modify init method
+- `BrokerInterceptor` - accepts PulsarService
+- `BrokerEntryMetadataInterceptor`
+
+In a sub-PIP we will consider how to update those interfaces effectively
allowing to pass `OpenTelemetry` instance, so they can create their own Meter
or have access to an auxiliary class and supply only a Pulsar meter. Note that
each a Meter has its own name and version and those are emitted as two
additional attributes - e.g. `{..., otel_scope_name="s3_offloader_plugin"
otel_scope_version="1.2", ...}`, to avoid any metric name collision.
+
+## Why OpenTelemetry?
+
+### What’s good about OTel?
+
+- It’s the new emerging industry-wide standard for observability and specific
metrics, as opposed to just a library or a standard adopted and promoted by a
single entity/company.
+- It’s much more sophisticated than the other libraries
+ - OTel has the ability to change instruments by overriding their initial
definition. For example, a Pulsar operator can change buckets of a histogram
for a given bucket, reduce attributes if needed, or even copy-paste an
instrument, changing its bucket while maintaining the original one if needed.
This feature is called a View.
+ - Its API is very clear. For example, a gauge can not be aggregated (i.e.,
CPU Usage), while UpDownCounter can (number of jobs in a queue).
+ - Using OpenTelemetry Logs and Traces will allow sharing of context
between them, making using Pulsar telemetry more powerful and helpful (Out of
scope for this PIP, but possible)
+ - Using an industry-standard API means when in the future libraries will
accept `OpenTelemetry` interface for reporting traces/metrics/logs, the
integration of it will not require any special development efforts.
+ - Industry-standard also means when new developers onboard, they don’t
need to learn something new
+ - The SDK is still in the adoption/building phase, so they are more
receptive to accepting changes from the community relative to other libraries
(This was quite evident from issues I’ve opened that got fixed, community
meetings attended, and brainstorming sessions held with maintainers)
+ - Its design is the most elegant and correct compared to all other
libraries (IMO). The idea of each instrument having an interchangeable
aggregation, which is also how they implemented it is smart. The same goes for
Reader and Exporter separation and Views.
+ - It has support to decide if one metric or a family of them will be delta
or cumulative. For Elasticsearch/OpenSearch users, it’s super powerful, as it
allows them to create the same metrics with different names containing delta
values and then feed only them to Elastic using the OTel Collector
+ - Its protocol is much more efficient than other protocols (i.e.,
Prometheus text exposition format)
+ - The library allows exporting the metrics as Prometheus and OTLP (push to
OTel Collector) and it’s extendable by design
+ - It has same API and implementation design for Python and Go, which we
also need to support for the wrapper code running Pulsar Functions.
+
+ ### Why not other libraries?
+
+ Below I will list the libraries I found and why I don’t think they are
suitable.
+
+ - Micrometer
+ - Micrometer had the vision of becoming the industry standard API like
SLF4J is for logging in the Java ecosystem. In reality, it didn't catch on, as
can be seen in the Maven Central statistics: It’s used by ~1000 artifacts,
compared to `sl4fj-api` which is used by 60k artifacts—as such, picking it as
the standard for today, seems like “betting” on the wrong project.
+ - Micrometer architecture relies heavily on the library to implement
all target systems like Datadog, Prometheus, Graphite, OTLP, and more. OTel
relies on the collector to implement that as it has more power and can contain
the state if one of those systems goes down for some time. I think it’s a
smarter choice, and more vendors will likely appear and maintain their exporter
in OTel collector as we advance. This makes it easier for operators to have one
exporter code base (say to [...]
+ - OTel was built with instrumentation scope in mind, which gives a
sort of namespace per library or section of the code (Called Meter in the API).
For Pulsar, it can be used to have one per plugin. Micrometer doesn't have that
notion. It’s great especially if Pulsar and another plugin are using same
library (e.g. Caffeine for caching), thus in Prometheus or other libraries the
metrics will override each other, but in OTel the meter provides an attribute
for name and version, thus [...]
+ - OTel by design has an instrument that you report measurements for a
given attribute set, meaning it has that design of `instrument =
map(attributes→values)`. In Micrometer, it’s designed in a way that each
`(instrument, attributes)` is a metric on its own. Less elegant and more
confusing.
+ - Most innovations are likely to happen in the “new kids on the
block,” which is OTel.
+ - Dropwizard Metrics (previously Codahale)
+ - Doesn't support different attributes per instrument (no tag
support). It was slated for Dropwizard 5.x, but there is no available
maintainer available to work on it, which is a problem on its own.
+ - Prometheus Client
+ - Currently, prometheus allocates all needed memory per collection.
For a large amount of topics, this is a substantial performance issue. We tried
conversing with them and pitched an observer pattern. They objected to the idea
and wanted benchmark proof. The maintainer thinks it has added complexity. See
[here](https://github.com/prometheus/client_java/pull/788#issuecomment-1179611397).
In OTel they were happy to brainstorm the problem via GitHub issue, their
weekly calls and pr [...]
+ - Only the Prometheus format is supported for export. OTLP is a more
compact protocol since it packs all the buckets as a map of bucket numbers to
their value instead of carrying all the labels for each bucket as the
Prometheus client.
+ - The library doesn't have the notion of different exporters as it was
geared to export only to Prometheus.
+ - No integration with Logs or Traces which will be needed in the
future.
+
+## What we need to fix in OpenTelemetry
+
+- Performance
+ - Once we have 1M topics per broker, each topic producing ~70 metric data
points (that’s in a super relaxed assumption: we have one producer and one
consumer), we’re talking about 70M metric data points.
+ - The `MetricReader` interface and the `MetricsExporter` interfaces
were designed to receive the metrics collected from memory by the SDK using a
list, and for each collection cycle, an allocation of at least 70M objects
(Metric data points).
+ - The OTLP exporter specifically serializes the data points to
protobuf by creating a Marshaller object per each piece of data in the data
point, so 10-30 times 70M metric data points, which are objects to be garbage
collected.
+ - I have opened an issue starting discussion on trying to solve that:
https://github.com/open-telemetry/opentelemetry-java/issues/5105
+ - After discussion, the maintainer suggested object re-use per
attribute set. He already implemented over 60% of the code needed to support
it. We need to help the project finish it as detailed in issue description
(mainly collection path should be allocation free, including exporters).
+- There is a non-configurable hard limit of 2000 attributes per instrument
+ - There is a
[PR](https://github.com/open-telemetry/opentelemetry-specification/pull/2960)
to the specifications to allow configuring that limit
+ - Once the spec is approved, OTel Java SDK must also be amended.
+- Supporting push-down predicate to filter (instrument, attribute) pair in
`MetricsProducer`
+ - https://github.com/open-telemetry/opentelemetry-specification/issues/3324
+ - This is needed for us to have performant filtering.
+- Fix bug: https://github.com/open-telemetry/opentelemetry-java/issues/4901
+
+**Nice to have**
+
+- Ability to remove attributes from an instrument
+ - Issue we opened:
https://github.com/open-telemetry/opentelemetry-specification/issues/3062
+ - This will allow us to use histograms if needed on dimensions such as
topic and topic group.
+
+**Issues completed while writing the design**
+
+- Add ability to specify histogram buckets while creating an instrument
+ -
[https://github.com/open-telemetry/opentelemetry-specification/issues/2229](https://github.com/open-telemetry/opentelemetry-specification/issues/2229)
+- Add ability to specify units as instrument selector in a view
+ - Issue we have added to add it in spec:
https://github.com/open-telemetry/opentelemetry-specification/issues/3101
+ - This was implemented by the maintainers in the Java SDK
+
+## Specifying units for histograms
+
+We have two ways to do that:
+
+1. Since all latency with same units (milliseconds) share same buckets of
histograms, we can use a view, and select all instruments in Pulsar Meter which
has units of milliseconds, and there specify the buckets.
+2. Specify buckets using newly added hints, at instrument creation. This
requires creating a constant and re-using it across all histograms.
+
+## Filtering Configuration
+
+As mentioned in the high level design, we will define an interface allowing to
have multiple built-in and also custom implementations of filtering. The
interface will determine per each metric data point if it will be filtered or
not. The data point is composed of: instrument name, attributes, unit, type.
+
+Our default implementation will be ruled based, and will have the following
configuration. I used here
[HOCON](https://github.com/lightbend/config/blob/main/HOCON.md) to make it less
verbose. Exact syntax will be determined in sub-PIP. The configuration will be
dynamic, allowing operators to change it in runtime. The exact mechanisms will
be detailed in the sub-PIP.
+
+```hocon
+rules {
+ // All instruments starting with pulsar_, with a topic attribute
+ // will be dropped by default. This will keep only topicMetricGroup and
namespace level.
+ default {
+ instrumentSelect = "pulsar_*"
+ attrSelect {
+ topic = "*"
+ }
+ filterInstruments {
+ dropAll = true
+ }
+ }
+
+ // single topic, highest granularity
+ bi-data {
+ instrumentSelect = "pulsar_*"
+ attrSelect {
+ topicGroup = "bi-data"
+ }
+ filterInstruments {
+ keepAll = true
+ }
+ }
+
+ // multiple topics, highest granularity, only metrics I need
+ receipts {
+ instrumentSelect = "pulsar_*"
+ attrSelect {
+ topic = "receipts-us-*"
+ }
+ filterInstruments {
+ keepOnly = ["pulsar_rate_*"]
+ }
+ }
+
+ // single topic, don't want subscriptions and consumer level
+ logs {
+ instrumentSelect = "pulsar_*"
+ attrSelect {
+ topic = "logs"
+ }
+ filterInstruments {
+ dropOnly = ["pulsar_subscription_*", "pulsar_consumer_*"]
+ }
+ }
+}
+```
+
+The configuration is made up of a list of “filtering rules”. Each rule
stipulates the following:
+
+- A name, for documentation purposes.
+- `instrumentSelect` - ability to select one a more instruments apply this
filtering rule for.
+ - Example: `pulsar_*`
+- `attrSelect` - ability to select a group of attributes to apply this rule
on, within the selected instruments.
+ - Example `topic=receipt-us-*`
+- `filterInstruments` - For each instrument matched, we allow to set whether
we wish to drop the attributes selected for certain instruments or keep them.
+ - Either:
+ - `dropOnly` - list the instruments we wish to drop selected
attributes for. Example: if `instrumentSelect` is `pulsar_*` , `attrSelect` is
`topic="incoming-logs"` and drop is `pulsar_subscription_*` and
`pulsar_consumer_*` this means all messaging instruments in the topic level
will remain and subscription and consumer level metrics will be dropped, **only
for “incoming-logs”** topic.
+ - `keepOnly` - The opposite of drop.
+ - `dropAll`
+ - `keepAll`
+
+Order matter for the list of filtering rules. This allows us to set a default
which applies to a wide range of instruments and then override it for certain
instruments.
+
+We will supply a default filtering rules configuration which should make sense.
+
+In certain cases the number of rules can reach to 10k or 20k. Since each rule
is essentially a regular expression, we will need to cache the resolve of
(instrument, attribute) to true/false. For 10k rules, meaning 10k regular
expression this might be too much, so we can use
[aho-corasik](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm)
algorithm to match only on select few regular expressions.
+
+We have proposed an issue in OpenTelemetry Java SDK to have a push-down
predicate, allowing us to do the filtering while iterating over instruments and
their attributes. It means OTel SDK will not allocate a datapoints objects if a
certain (instrument, attributes) pair is filtered out. See [section
above](#what-we-need-to-fix-in-opentelemetry) on what we wish to fix in OTel to
see issue link.
+
+There will be a sub-PIP detailing the exact configuration syntax, storage, how
it will be made dynamic, plugin details and more. We will to take into account
user experience defining and updating that configuration, especially when it
will reach a large size. Also, we want the experience of adding a rule to be
smooth, so the CLI should offer auto-complete of metric names if possible and
optionally to have a UI dedicated for it. We will also try to protect and
validate the filtering rules [...]
+
+
+## Which summaries / histograms are effected?
+
+As mentioned in the design, summaries will be converted into histograms, and
those and existing histograms will be modified to be at the granularity level
of namespace.
+
+### Summaries
+
+Prometheus Client
+
+- `LedgerOffloaderStatsImpl`
+ - `brk_ledgeroffloader_read_offload_index_latency`
+ - `brk_ledgeroffloader_read_offload_data_latency`
+ - `brk_ledgeroffloader_read_ledger_latency`
+ - We need to change to be reported per namespace, not per topic
+- `ResourceGroupService`
+ - `pulsar_resource_group_aggregate_usage_secs`
+ - `Time required to aggregate usage of all resource groups, in
seconds.`
+ - `pulsar_resource_group_calculate_quota_secs`
+ - `Time required to calculate quota of all resource groups, in seconds`
+ - Not high cardinality, no need to modify aggregation level.
+- `ReplicatedSubscriptionsSnapshotBuilder`
+ - `pulsar_replicated_subscriptions_snapshot_ms`
+ - `Time taken to create a consistent snapshot across clusters`
+ - Can be NS level
+- `SchemaRegistryStats`
+ - `pulsar_schema_del_ops_latency`
+ - `pulsar_schema_get_ops_latency`
+ - `pulsar_schema_put_ops_latency`
+- `BrokerOperabilityMetrics`
+ - `topic_load_times`
+ - in milliseconds
+- `TransactionBufferClientStatsImpl`
+ - `pulsar_txn_tb_client_abort_latency`
+ - `pulsar_txn_tb_client_commit_latency`
+ - Change it to be NS level, and not topic level.
+- `PendingAckHandleStatsImpl`
+ - `pulsar_txn_tp_commit_latency`
+- `ContextImpl`
+ - `pulsar_function_user_metric_`*
+ - not high cardinality, and each process runs only one function , so
we can use histograms freely.
+- `FunctionStatsManager`
+ - `pulsar_function_``process_latency_ms`
+ - `pulsar_function_``process_latency_ms_1min`
+ - we’ll remove the 1min ones
+ - not high cardinality, and each process runs only one function , so
we can use histograms freely
+- `WorkerStatsManager`
+ - `pulsar_function_worker_start_up_time_ms`
+ - Shouldn’t be a summary in the first place as it is init once
+ - `schedule_execution_time_total_ms`
+ - 6 more of those
+
+Our Summary
+
+- `ModularLoadManagerImpl`
+ - `pulsar_broker_load_manager_bundle_assigment`
+ - `pulsar_broker_lookup`
+- `AbstractTopic`
+ - `pulsar_broker_publish_latency`
+ - broker level
+
+OpsStatLogger (uses DataSketches)
+
+- `PulsarZooKeeperClient`
+ - one instance per action running against ZK
+- Bookkeeper client metrics
+ - About 10 operation latencies
+
+### Histograms
+
+StatsBucket (Our version of Explicit Bucket Histogram)
+
+- `ManagedLedgerMBeanImpl`
+ - `pulsar_storage_write_latency_le_*`
+ - `pulsar_storage_ledger_write_latency_le_*`
+ - `pulsar_entry_size_le_*`
+ - all above are topic level
+- `CompactionRecord`
+ - `pulsar_compaction_latency_*`
+ - topic level
+- `TransactionMetadataStoreStats`
+ - `pulsar_txn_execution_latency_le_`
+ - labels: cluster, coordinatorId
+
+## Fixing Grafana Dashboards in repository
+
+Pulsar repository contains multiple dashboards. We will create the same
dashboards using new names alongside existing ones. We’ll add dashboards for
different granularity levels: namespace, group and zoom in on specific
group/topic.
+The dashboards will look the same as existing, since the main changes are done
to the query of each panel since the metric name has changed, not the semantics.
+
+This will be specified in a sub-PIP.
+
+# Backward Compatibility
+
+## Breaking changes
+
+All changes reported here are applicable to the newly added OTel metrics
layer. At first, as mentioned in the document, we’ll be able to use both
existing metrics system and OTel metric system - you can toggle each. Once
everything will be stabilized, we’ll deprecate the current metric system.
+
+- Names
+ - Attribute names will use OTel semantic conventions as much as possible
and also [Attribute Naming
guide](https://opentelemetry.io/docs/reference/specification/common/attribute-naming/).
+ - Instrument names will follow guidelines mentioned in [OTel Metrics
Semantic
Conventions](https://opentelemetry.io/docs/reference/specification/metrics/semantic_conventions/)
+ - Histograms will not encode the bucket range in the instrument name
+ - Each domain will have a proper prefix in the instrument name. Biggest
example is messaging related metrics which today are prefixed with `pulsar_`
but should be `pulsar_messaging_`.
+- Summary metrics are changed to be histogram metrics
+- Histograms are changed from topic level to namespace level.
+- The following configuration flags will be deprecated and eventually removed:
`exposeConsumerLevelMetricsInPrometheus`, `exposeTopicLevelMetricsInPrometheus`
and `exposeProducerLevelMetricsInPrometheus`.
+- Counters / Histograms / Gauges will no longer be reset, but will be
cumulative (the user will have the option to modify it to delta temporality per
their needs for backends which supports it via OpenTelemetry views).
+- `topic` will not contain partition number, but it will be specified via
`partitionNum` attribute.
+- Configuration `splitTopicAndPartitionLabelInPrometheus` will be deprecated
and eventually removed.
+- Most Pulsar plugins will be modified to allow reporting metrics to OTel via
special object Pulsar will supply. Any plugin not using it, will not have its
metrics reported to OTel.
+- At phase 1 we won’t support Process Runtime in OTel metrics, but only Thread
and Kubernetes. If the community will ask for it and discussion yield it as
must, we’ll add it.
+- All `_1min` metrics are removed from Pulsar Function metrics
+- `resetMetrics` RPC operation in processes running Pulsar Function will be
removed
+- User defined metrics in Pulsar Functions will be reported as 0-bucket
histogram (offering count and sum), instead of Summary.
+
+# API Changes
+
+The sub-PIPs will specify the exact API changes relevant to their constrained
scope, since it will be much easier to review as such.
+
+# Security
+
+The sub-PIPs will specify the exact security concerns relevant to their
constrained scope, since it will be much easier to review as such.
+
+# Links
+
+* Mailing List discussion thread:
https://lists.apache.org/thread/83g3l8doy3hj4ytm36k63z9xv8nj039x
+* Mailing List voting thread:
https://lists.apache.org/thread/m5k8hj874nkjx1vh0s6lwvhs7q7rgj6x