Hello everyone!

We are running Prometheus on a larger scale - 35 million metrics. We have 
been at this scale for some time now (~1 year). For the last two months we 
started to experience problems after a large compaction. This happens from 
time to time, not periodically, without a time pattern. What happens is 
that Prometheus stops responding to API requests, /metrics endpoints 
doesn't work and it stops doing internal processes (further compactions 
don't happen, alerting rules evaluation throws errors), but seems that it 
continues to scrape all the targets (network usage on the machine does not 
drop and the WAL increases). The only way to get it out of this state is to 
restart the Prometheus docker container.
*Image that shows it happens during (or maybe right before it finishes) the 
large compaction:*
[image: Untitled.png]

*Log for alert execution error:*
level=warn ts=2022-03-28T10:59:43.076Z caller=manager.go:603 
component="rule manager" group=push_requests_submitted.alert 
msg="Evaluating rule failed" rule="alert: Push_Requests_Submitted expr: 
"expression" for: 1m " err="query timed out in query queue"

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/fe553413-a52d-4d9b-9a6f-c5fe82f95609n%40googlegroups.com.

Reply via email to