On 25/02/2021 20:29, Saurabh Vartak wrote:
Hi Stuart,

Thanks again for your help and continued guidance. So if I am able to summarise your suggestions in a nutshell: 1. If there is a requirement to have aggregated metrics in place, Prometheus Federation would be the way to go. 2. If there is a requirement for long term retention (either for a single Prometheus server or a bunch of Prometheus servers) an external storage solution like Cortex or Thanos can be used.

I hope I am correct with the above two points.

Also, I needed your help on the below 2 questions to wrap this thread:
1. When we use Prometheus Federation, the metrics sent from a Prometheus server to a Centralized Prometheus server do get stored in the TDSB of the Centralized Prometheus Server. Is the understanding correct?

That is correct. The central server sees the federation with the other server in exactly the same way as any other scrape target.

So whatever storage duration and any remote write configuration would apply (in the same way as any other targets the central server scrapes).

2. When we use Prometheus Federation, all the metrics scraped by a Prometheus server can be sent to the Centralized Prometheus server. However as a best practice, it is always recommended to send only the aggregated metrics to the Centralized Prometheus server. Is the understanding correct?

Federation is different to "sending" metrics around. In particular when a server scrapes the federation endpoint it returns the latest value for all metrics that have been selected at that point in time. For example if say the scrape period of a target was 30s but the period for the federation was 120s then the local server would hold 4 values for every 2 minute period, but the central server will only contain 1.

While you could try to use federation to fetch all metrics (remembering that you wouldn't necessarily get all values scraped by the local server) you may quickly find resource limitations. The text format used would not be as efficient as that used for remote write for example, so you might find high network or CPU usage for both servers. Equally, depending on the quantity of metrics and the scrape interval chosen for the central server you could find the volume was so great that it failed to ingest within the timeout period (at maximum the same as the scrape period).

This would be in addition to the multiplication effect of trying to store all metrics in a central server (a server could handle 1 million time series, but trying to federate all metrics from 100 such servers centrally would need the central server to handle 100 million time series, which would likely require a lot more resources than would reasonably be available).

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/bf446ede-e735-da31-5731-07f994a4dd3e%40Jahingo.com.

Reply via email to