[GitHub] [pulsar] michaeljmarshall opened a new pull request, #17419: [fix][broker] Remove timestamp from Promtheus metrics

GitBox Thu, 01 Sep 2022 20:58:48 -0700


michaeljmarshall opened a new pull request, #17419:
URL: https://github.com/apache/pulsar/pull/17419

### Motivation

When a Pulsar topic is unloaded from a broker, certain metrics related to
that topic will appear to remain active for the broker for 5 minutes. This is
confusing for troubleshooting because it makes the topic appear to be owned by
multiple brokers for a short period of time. See below for a way to reproduce
this behavior.

In order to solve this "zombie" metric problem, I propose we remove the
timestamps that get exported with each Prometheus metric served by the broker.

### Analysis

Since we introduced Prometheus metrics in #294, we have exported a timestamp
along with most metrics. This is an optional, valid part of the spec defined
[here](https://prometheus.io/docs/instrumenting/exposition_formats/#comments-help-text-and-type-information).
However, after our adoption of Prometheus metrics, the Prometheus project
released version 2.0 with a significant improvement to its concept of
staleness. In short, before 2.0, a metric that was in the last scrape but not
the next one (this often happens for topics that are unloaded) will essentially
inherit the most recent value for the last 5 minute window. If there isn't one
in the past 5 minutes, the metric becomes "stale" and isn't reported. Starting
in 2.0, there was new logic to consider a value stale the very first time that
it is not reported in a scrape. Importantly, this new behavior is only
available if you do not export timestamps with metrics, as documented here:
https://prometheus.io/docs/prometheus/latest/
querying/basics/#staleness. We want to use the new behavior because it gives
better insight into all topic metrics, which are subject to move between
brokers at any time.

This presentation https://www.youtube.com/watch?v=GcTzd2CLH7I and slide deck
https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf document
the feature in detail. This blog post was also helpful:
https://www.robustperception.io/staleness-and-promql/.

Additional motivation comes from mailing list threads like this one
https://groups.google.com/g/prometheus-users/c/8OFAwp1OEcY. It says:

> Note, however, that adding timestamps is an extremely niche use
case. Most of the users who think the need it should actually not do
it.
>
> The main usecases within that tiny niche are federation and mirroring
the data from another monitoring system.

As such, I think we are not a niche use case, and we should not add
timestamps to our metrics.

### Reproducing the problem

1. Run any 2.x version of Prometheus (I used 2.31.0) along with the
following scrape config:
```yaml
- job_name: broker
honor_timestamps: true
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
follow_redirects: true
static_configs:
- targets: ["localhost:8080"]
```
2. Start pulsar standalone on the same machine. I used a recently compiled
version of master.
3. Publish messages to a topic.
4. Observe `pulsar_in_messages_total` metric for the topic in the prometheus
UI (localhost:9090)
5. Stop the producer.
6. Unload the topic from the broker.
7. Optionally, `curl` the metrics endpoint to verify that the topic’s
`pulsar_in_messages_total` metric is no longer reported.
8. Watch the metrics get reported in prometheus for 5 additional minutes.

When you set `honor_timestamps: false`, the metric stops getting reported
right after the topic is unloaded, which is the desired behavior.

### Modifications

* Remove all timestamps from metrics
* Fix affected tests and test files (some of those tests were in the proxy
and the function worker, but no code was changed for those modules)

### Verifying this change

This change is accompanied by updated tests.

### Does this pull request potentially affect one of the following parts:

This is technically a breaking change to the metrics, though I would
consider it a bug fix at this point. I will discuss it on the mailing list to
ensure it gets proper visibility.

Given how frequently Pulsar changes which metrics are exposed between each
scrape, I think this is an important fix that should be cherry picked to older
release branches. Technically, we can avoid cherry picking this change if we
advise users to set `honor_timestamps: false`. However, I think it is better to
just remove them.

### Documentation
- [x] `doc-not-needed`

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [pulsar] michaeljmarshall opened a new pull request, #17419: [fix][broker] Remove timestamp from Promtheus metrics

Reply via email to