On 01/03/2021 06:44, Manjula Amunugama wrote:
Hi all,

In our environment for monitoring about 200 micro-services, we use Prometheus & Grafana.

In one application to another, developers used different different strings as the namespace component. i.e.  we have used Prometheus keys like "booking_engine_driver_eta_location_service_outboundcall_latency_microseconds_count" to count the latency from "BookingEngine.Driver-ETA" to "Location-Service" In this "Booking Engine" is the "Service Group" and "Driver-ETA" is the service and "Location-Service" is the outbound service

In monitoring its a must to monitor "Inbound Request Rates by Endpoint", "Inbound Request Error Rates by Endpoint", "Processing Latency by Endpoint", "Outbound Request Rates by Endpoint", "Outbound Request Rates by Endpoint", "Outbound Request Error Rates by Endpoint" for API based requests.

We can monitor all the services with about 3 dashboards "Inbound Service Monitor Rates", "Outbound Service Monitor Rates", "Processing Latencies" we know the Prometheus keys used.
So we wanted to standardize the Prometheus Keys as the following
- We use namespace to define the "Development Team"
- Application Name will be a label in the key - i.e. label will be "app"
- Endpoint also will be a label in the key
- Error will be a label in the key

So the previous key with labels will be changed to "outboundcall_latency_microseconds_count{app="booking_engine_driver_eta_location_service"}"

Doing this we can automate most of the things related Dashboarding and Alerting.

By doing this about 200 time series-es will be grouped into about 4 groups and hence 200 time series into 4 time series.

Doing so, will there be a big hit for Prometheus performance?

A time series is different to a metric.

A metric has a name and an optional selection of labels.

A time series is one specific metric & label combination.

So, for example, a metric could be called "requests_count", but two time series could be "requests_count{response_code='200'}" or "requests_count{system='frontend',authenticated='false'}".

As a result, in terms of the number of time series there is no difference between 100 metrics with no labels and a single metric with a label with 100 values.

How the difference affects performance will depend on how things are being used. There is likely to be little difference in performance during scraping, but query usage could make a bigger difference. A metric with labels is expected to be aggregatable, so it would make sense to arrange the data in that way if that would be true. If you were to sum together all the different label combinations of a particular metrics would the result make sense? An example, a metrics which counts requests and has labels for error code would still make sense if you summed everything together (rather than requests per code you would have total number of requests).

Would it make sense in your case to use labels within a single metric? If the different systems are completely unrelated that might not be the case - a sum wouldn't mean anything and an average would be equally useless as the different systems do a totally different selection of work. However if you are looking at latencies end-to-end across multiple systems in a flow, or have multiple instances of a system, then it does sound like the use of labels would make more sense - sum would give you the overall end-to-end latency or you could produce averages for a particular system across instances.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/10c3f2ab-170c-eec5-5449-56ba6c84e340%40Jahingo.com.

Reply via email to