[
https://issues.apache.org/jira/browse/HDDS-15552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089396#comment-18089396
]
Roland Elek commented on HDDS-15552:
------------------------------------
Besides the syntax implications, Prometheus creates a new time series for each
set of label values. This is important, as its resource footprint primarily
scales with the number of active time series. This makes a label value that
changes frequently over an infinite domain (like a timestamp, a container ID,
or a list of recent state machine events as arbitrary strings with or without
these fields) very expensive.
With 3 OMs, 3 SCMs, 15 days of retention, and a 15-second scrape interval, we
get about 500k time series - manageable on their own, but decidedly significant
for a single Prometheus instance.
> Ratis events should not be published as metrics
> -----------------------------------------------
>
> Key: HDDS-15552
> URL: https://issues.apache.org/jira/browse/HDDS-15552
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Ethan Rose
> Assignee: Wei-Chiu Chuang
> Priority: Major
> Labels: pull-request-available
>
> HDDS-13133 started tracking Ratis events (arbitrary strings) as metrics.
> These then get exposed over JMX and Prometheus. This completely prevents
> Prometheus from scraping these endpoints because it fails when any of the
> messages have invalid characters like " or \n. We can keep the list of events
> in memory and maintain the web UI functionality without exposing it as a
> metric.
> Additionally to verify this change, we should add an acceptance test call to
> {{GET http://<prometheus-host>:9090/api/v1/targets}} and ensure that
> {{health=up}} for each component to prevent future regressions like this.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]