Re: [prometheus-users] Storing and graphing the number of terminated EC2 Instances

Stuart Clark Thu, 23 Apr 2020 13:29:20 -0700

On 23/04/2020 18:43, Lukas wrote:

Hi,
I'm fairly new to Prometheus so bear with me. What I'm trying to do isgraph the number of EC2 Instances that have been deleted by a tool ofmine. I'm not even sure Prometheus is the right tool for the job.
I have a tool that's persistently running. Every hour or so itterminates a number of unused AWS EC2 Instances and exports thenumbers at a /metrics endpoint. It currently generates aGaugeMetricFamily called cleaned_instances_total with thelabels[cloud, region, type] containing e.g. ['aws', 'us-west-2','m5.xlarge'] and as value the number of Instances it just removed.
Prometheus scrapes the target like twice an hour. Now the firstProblem I ran into is I'd like to graph the number of instances thatwere removed in a day, per region and type. However because Prometheusscrapes the target more often than the instances are being cleaned, Iget repeating values. Very visible on this output:

The maximum reasonable scrape interval is around 2 minutes due tostaleness, so trying to scrape every 30 minutes will likely cause issues.

Like with my human eye I can tell that those exact repeating valuesare likely because the metrics were scraped before the next cleanuprun occurred. But I don't know how to express that in a PromQL query.
I suppose I could make the Gauge a Counter, but even then; lets say Ihave two metrics where 10 instances each have been terminated, howwould I know if the 10 instances in my second timestamp are the same10 from the first one, or if the tool terminated 10 instances, wasrestarted and the Counter reset to 0 and then terminated another 10instances?

Prometheus isn't designed to give "exact" billing level answers, but ismore for "good enough" information for system monitoring purposes. Youare right that a counter reset will result in some potential loss ofdata between scrapes.

I would suggest using a counter of the number of instances terminated(BTW your metric naming already suggests it is a counter from the namingrecommendations) and setting the scrape interval to be something morelike 1-2 minutes.

I guess one question right now would be, is there a way to deduplicatethose identical values? Like, if all labels and values at a point intime are the same as the ones from previous points in time consider itas a single timestamp... or something along those lines.
Is Prometheus even the right tool for what I'm trying to do? BasicallyI'm not trying to graph something that happens over time, I'm tryingto graph some number of events that happen at a point in time.

It really depends what you are hoping for. If you want a graph over timewith spikes when things roughly happen, then it can do that - graph therate() of your counter. If however you are wanting perfect details ofwhen things happened you are wanting an event system rather than metric,so something like Elasticsearch, Splunk or a more generalised database.



--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ca415371-b605-3cd6-0a2b-4e3c740a80b3%40Jahingo.com.

Re: [prometheus-users] Storing and graphing the number of terminated EC2 Instances

Reply via email to