Only you can determine that, by comparing the lists of alerts from both 
sides and seeing what differs, and looking into how they are generated and 
measured. There are all kinds of things which might affect this, e.g. 
pending/keep_firing_for alerts, group wait etc.

But you might also want to read this:

If you're generating more than a handful of alerts per day, then maybe you 
need to reconsider what constitutes an "alert".

On Saturday 30 March 2024 at 09:49:04 UTC Trio Official wrote:

> Thank you for your prompt response and guidance on addressing the metric 
> staleness issue.
> Regarding metric staleness  I confirm that I have already implemented the 
> approach to use square brackets for the recording metrics and alerting rule
>  (e.g. max_over_time(metric[1h])). However, the main challenge persists 
> with the discrepancy in the number of alerts generated by Prometheus 
> compared to those displayed in Alertmanager. 
> To illustrate, when observing Prometheus, I may observe approximately 
> 25,000 alerts triggered within a given period. However, when reviewing the 
> corresponding alerts in Alertmanager, the count often deviates 
> significantly, displaying figures such as 10,000 or 18,000, rather than the 
> expected 25,000.
> This inconsistency poses a significant challenge in our alert management 
> process, leading to confusion and potentially overlooking critical alerts.
> I would greatly appreciate any further insights or recommendations you may 
> have to address this issue and ensure alignment between Prometheus and 
> Alertmanager in terms of the number of alerts generated and displayed.
> On Saturday, March 30, 2024 at 2:29:42 PM UTC+5:30 Brian Candler wrote:
>> On Friday 29 March 2024 at 22:09:18 UTC Chris Siebenmann wrote:
>> I believe that recording rules and alerting rules similarly may have 
>> their evaluation time happen at different offsets within their 
>> evaluation interval. This is done for the similar reason of spreading 
>> out the internal load of rule evaluations across time.
>> I think it's more accurate to say that *rule groups* are spread spread 
>> over their evaluation interval, and rules within the same rule group are 
>> evaluated 
>> sequentially 
>> <>.
>> This is how you can build rules that depend on each other, e.g. a recording 
>> rule followed by other rules that depend on its output; put them in the 
>> same rule group.
>> As for scraping: you *can* change this staleness interval, 
>> using --query.lookback-delta, but it's strongly not recommended. Using the 
>> default of 5 mins, you should use a maximum scrape interval of 2 mins so 
>> that even if you miss one scrape for a random reason, you still have two 
>> points within the lookback-delta so that the timeseries does not go stale.
>> There's no good reason to scrape at one hour intervals:
>> * Prometheus is extremely efficient with its storage compression, 
>> especially when adjacent data points are equal, so scraping the same value 
>> every 2 minutes is going to use hardly any more storage than scraping it 
>> every hour.
>> * If you're worried about load on the exporter because responding to a 
>> scrape is slow or expensive, then you should run the exporter every hour 
>> from a local cronjob, and write its output to a persistent location (e.g. 
>> to PushGateway or statsd_exporter, or simply write it to a file which can 
>> be picked up by node_exporter textfile-collector or even a vanilla HTTP 
>> server).  You can then scrape this as often as you like.
>> node_exporter textfile-collector exposes an extra metrics for the 
>> timestamp on each file, so you can alert in the case that the file isn't 
>> being updated.

You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
To view this discussion on the web visit

Reply via email to