On 25/02/2021 20:29, Saurabh Vartak wrote:
Hi Stuart,
Thanks again for your help and continued guidance. So if I am able to
summarise your suggestions in a nutshell:
1. If there is a requirement to have aggregated metrics in place,
Prometheus Federation would be the way to go.
2. If there is a requirement for long term retention (either for a
single Prometheus server or a bunch of Prometheus servers) an external
storage solution like Cortex or Thanos can be used.
I hope I am correct with the above two points.
Also, I needed your help on the below 2 questions to wrap this thread:
1. When we use Prometheus Federation, the metrics sent from a
Prometheus server to a Centralized Prometheus server do get stored in
the TDSB of the Centralized Prometheus Server. Is the understanding
correct?
That is correct. The central server sees the federation with the other
server in exactly the same way as any other scrape target.
So whatever storage duration and any remote write configuration would
apply (in the same way as any other targets the central server scrapes).
2. When we use Prometheus Federation, all the metrics scraped by a
Prometheus server can be sent to the Centralized Prometheus server.
However as a best practice, it is always recommended to send only the
aggregated metrics to the Centralized Prometheus server. Is the
understanding correct?
Federation is different to "sending" metrics around. In particular when
a server scrapes the federation endpoint it returns the latest value for
all metrics that have been selected at that point in time. For example
if say the scrape period of a target was 30s but the period for the
federation was 120s then the local server would hold 4 values for every
2 minute period, but the central server will only contain 1.
While you could try to use federation to fetch all metrics (remembering
that you wouldn't necessarily get all values scraped by the local
server) you may quickly find resource limitations. The text format used
would not be as efficient as that used for remote write for example, so
you might find high network or CPU usage for both servers. Equally,
depending on the quantity of metrics and the scrape interval chosen for
the central server you could find the volume was so great that it failed
to ingest within the timeout period (at maximum the same as the scrape
period).
This would be in addition to the multiplication effect of trying to
store all metrics in a central server (a server could handle 1 million
time series, but trying to federate all metrics from 100 such servers
centrally would need the central server to handle 100 million time
series, which would likely require a lot more resources than would
reasonably be available).
--
Stuart Clark
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/bf446ede-e735-da31-5731-07f994a4dd3e%40Jahingo.com.