On 01/03/2021 06:44, Manjula Amunugama wrote:
Hi all,
In our environment for monitoring about 200 micro-services, we use
Prometheus & Grafana.
In one application to another, developers used different different
strings as the namespace component.
i.e. we have used Prometheus keys like
"booking_engine_driver_eta_location_service_outboundcall_latency_microseconds_count"
to count the latency from "BookingEngine.Driver-ETA" to
"Location-Service"
In this "Booking Engine" is the "Service Group" and "Driver-ETA" is
the service and "Location-Service" is the outbound service
In monitoring its a must to monitor "Inbound Request Rates by
Endpoint", "Inbound Request Error Rates by Endpoint", "Processing
Latency by Endpoint", "Outbound Request Rates by Endpoint", "Outbound
Request Rates by Endpoint", "Outbound Request Error Rates by Endpoint"
for API based requests.
We can monitor all the services with about 3 dashboards "Inbound
Service Monitor Rates", "Outbound Service Monitor Rates", "Processing
Latencies" we know the Prometheus keys used.
So we wanted to standardize the Prometheus Keys as the following
- We use namespace to define the "Development Team"
- Application Name will be a label in the key - i.e. label will be "app"
- Endpoint also will be a label in the key
- Error will be a label in the key
So the previous key with labels will be changed to
"outboundcall_latency_microseconds_count{app="booking_engine_driver_eta_location_service"}"
Doing this we can automate most of the things related Dashboarding and
Alerting.
By doing this about 200 time series-es will be grouped into about 4
groups and hence 200 time series into 4 time series.
Doing so, will there be a big hit for Prometheus performance?
A time series is different to a metric.
A metric has a name and an optional selection of labels.
A time series is one specific metric & label combination.
So, for example, a metric could be called "requests_count", but two time
series could be "requests_count{response_code='200'}" or
"requests_count{system='frontend',authenticated='false'}".
As a result, in terms of the number of time series there is no
difference between 100 metrics with no labels and a single metric with a
label with 100 values.
How the difference affects performance will depend on how things are
being used. There is likely to be little difference in performance
during scraping, but query usage could make a bigger difference. A
metric with labels is expected to be aggregatable, so it would make
sense to arrange the data in that way if that would be true. If you were
to sum together all the different label combinations of a particular
metrics would the result make sense? An example, a metrics which counts
requests and has labels for error code would still make sense if you
summed everything together (rather than requests per code you would have
total number of requests).
Would it make sense in your case to use labels within a single metric?
If the different systems are completely unrelated that might not be the
case - a sum wouldn't mean anything and an average would be equally
useless as the different systems do a totally different selection of
work. However if you are looking at latencies end-to-end across multiple
systems in a flow, or have multiple instances of a system, then it does
sound like the use of labels would make more sense - sum would give you
the overall end-to-end latency or you could produce averages for a
particular system across instances.
--
Stuart Clark
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/10c3f2ab-170c-eec5-5449-56ba6c84e340%40Jahingo.com.