Thanks for the reply.
 

> Are you saying there are 3 million time series for the 
> http_server_requests_seconds_bucket metric, or in total for the server?
>
 
Yes, 3 million time series for the http_server_requests_seconds_bucket 
metric. It's still a test scenario, but not too far away. 

Looking at your query the uri label looks very problematic - if that is the 
> URI called by an external Internet user that has infinite cardinality, as 
> they could just make up things. That could completely break your server. If 
> you do want to have some indication of the page requested I'd suggest some 
> sort of processing of the raw value. Remove as much as you can (query 
> parameters, final piece of the path?) and ideally match against an allow 
> list (with "other" for anything that is rejected). 
>

We are aware of the cardinality problem and use uri templates, e.g. 
/users/{id}, and status groups (although that doesn't make a big difference 
here).

We plan to have a consistent golden signal/RED dashboard for HTTP workloads 
in a service-oriented architecture, ideally with endpoint drill down for 
debugging/SLOs (not billing, just signals).
Think in the range of 50-100 services, few with lots of endpoints, many 
with just a few endpoints.

At some point, a single, vertically-scaled prometheus instance per 
environment won't be enough. We currently try to find out when.
With the current test scenario (3 million time series, 30d retention, 
running for 20d so far), we "only" see iteration misses in the rule group 
evaluation. Otherwise, the instance handles it pretty well.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/26eef189-96a6-4e84-8dd9-77c0f250da75n%40googlegroups.com.

Reply via email to