On 23/04/2020 18:43, Lukas wrote:
Hi,
I'm fairly new to Prometheus so bear with me. What I'm trying to do is
graph the number of EC2 Instances that have been deleted by a tool of
mine. I'm not even sure Prometheus is the right tool for the job.
I have a tool that's persistently running. Every hour or so it
terminates a number of unused AWS EC2 Instances and exports the
numbers at a /metrics endpoint. It currently generates a
GaugeMetricFamily called cleaned_instances_total with the
labels[cloud, region, type] containing e.g. ['aws', 'us-west-2',
'm5.xlarge'] and as value the number of Instances it just removed.
Prometheus scrapes the target like twice an hour. Now the first
Problem I ran into is I'd like to graph the number of instances that
were removed in a day, per region and type. However because Prometheus
scrapes the target more often than the instances are being cleaned, I
get repeating values. Very visible on this output:
The maximum reasonable scrape interval is around 2 minutes due to
staleness, so trying to scrape every 30 minutes will likely cause issues.
Like with my human eye I can tell that those exact repeating values
are likely because the metrics were scraped before the next cleanup
run occurred. But I don't know how to express that in a PromQL query.
I suppose I could make the Gauge a Counter, but even then; lets say I
have two metrics where 10 instances each have been terminated, how
would I know if the 10 instances in my second timestamp are the same
10 from the first one, or if the tool terminated 10 instances, was
restarted and the Counter reset to 0 and then terminated another 10
instances?
Prometheus isn't designed to give "exact" billing level answers, but is
more for "good enough" information for system monitoring purposes. You
are right that a counter reset will result in some potential loss of
data between scrapes.
I would suggest using a counter of the number of instances terminated
(BTW your metric naming already suggests it is a counter from the naming
recommendations) and setting the scrape interval to be something more
like 1-2 minutes.
I guess one question right now would be, is there a way to deduplicate
those identical values? Like, if all labels and values at a point in
time are the same as the ones from previous points in time consider it
as a single timestamp... or something along those lines.
Is Prometheus even the right tool for what I'm trying to do? Basically
I'm not trying to graph something that happens over time, I'm trying
to graph some number of events that happen at a point in time.
It really depends what you are hoping for. If you want a graph over time
with spikes when things roughly happen, then it can do that - graph the
rate() of your counter. If however you are wanting perfect details of
when things happened you are wanting an event system rather than metric,
so something like Elasticsearch, Splunk or a more generalised database.
--
Stuart Clark
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/ca415371-b605-3cd6-0a2b-4e3c740a80b3%40Jahingo.com.