sum by returns hundreds of results (one for each mountpoint) and they each appear to generate an alert event.
When we try to consolidate that to a single event by using topk, the event changes rapidly between instances This is the only thing logs showing a problem: level=warn ts=2020-10-22T14:10:32.912Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4103 limit=4096 On Thursday, October 22, 2020 at 9:34:45 AM UTC-6 [email protected] wrote: > > But when the query returns many results This causes alertmanager > problems. > > sum by returns too many and puts alertmanager to its knees which breaks > our alerting in general > > This is a little too vague. What problems are you referring to? Are you > seeing performance issues with alertmanager when there are too many alerts > or is it usability problem when you get multiple alerts for the same > underlaying problem? > On Thursday, 22 October 2020 at 16:24:56 UTC+1 [email protected] > wrote: > >> I have a query to find out if space is running out: >> (100 - (100 * >> node_filesystem_avail_bytes{job="special_host",mountpoint=~"/my_data/[a-zA-Z]*/.*"} >> >> / >> node_filesystem_size_bytes{job="special_host",mountpoint=~"/my_data/[a-zA-Z]*/.*"})) >> >> For simplicity lets substitute this with SIZE_QUERY >> >> This VM is very special because there are multiple metrics that are >> equivalent. >> I have two categories of mounts on the host: >> >> These group of mounts share the underlying storage and have duplicated >> values (Note for brevity only 2 out of many are included) >> {device="$DEVICE1",fstype="$FS1",instance="$INSTANCE1",job="special_host",mountpoint="/my_data/first"} >> >> 86.6186759625663 >> {device="$DEVICE2",fstype="$FS1",instance="$INSTANCE1",job="special_host",mountpoint="/my_data/second"} >> >> 86.6186759625663 >> >> These group of mounts do not share underlying storage >> {device="$DEVICE3",fstype="$FS2",instance="$INSTANCE1",job="special_host",mountpoint="/var/log"} >> >> 85.1214545444532 >> >> I want to alert when any single host is above the threshold. When the >> instance is not in the "shared" group, this is trivial. But when the query >> returns many results This causes alertmanager problems. >> >> My promql knowledge is lacking on how to get around this limitation, but >> these are the things I've tried. Each has a problemdoesn't >> >> topk- flaps between each of the alerting instances as the labels change. >> topk(1, sum by (instance, mountpoint, device) (SIZE_QUERY) > 80) >> >> sum by returns too many and puts alertmanager to its knees which breaks >> our alerting in general >> sum by (device, instance) (SIZE_QUERY) > 80 >> sum by (device, instance, mountpount) (SIZE_QUERY) > 80 >> >> max doesn't show the labels which makes notifications hard to debug the >> problem- what instance, what device? >> max(SIZE_QUERY > 80) >> >> Is there a possible solution to this I haven't considered >> > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/d067fca1-4ed3-4f5b-a94f-989255ef2c50n%40googlegroups.com.

