Please start by reading the documentation:
https://prometheus.io/docs/alerting/latest/configuration/#inhibit_rule
and the explanation I gave before:
https://groups.google.com/g/prometheus-users/c/yGW7JrO2aPQ/m/kcNHJWUwAAAJ

Then write and test your inhibit rule.

Then if it doesn't work, show the inhibit rules you've written, and 
examples of the alerts in question:
- the target alert (i.e. the one you want to suppress)
- the source alert (i.e. the one which should be suppressing it)

Make sure you include the full, unexpurgated set of labels on both.  You 
can get this from the alerts views in either prometheus or alertmanager.

On Tuesday, 30 August 2022 at 19:51:50 UTC+1 [email protected] wrote:

> Thanks Brian. "label_replace" did the magic and I am able to separate the 
> site numbers from the target using the below conf and in the alertmanager I 
> grouped them by "SNumber". But not sure how to use the inhibit rules to 
> suppress the alerts. Can you help? 
>
> And I want to use the "SNumber" grouping to suppress the alerts only when 
> Router/Hypervisor is down and in all other cases it needs to be 
> consolidated  alerts for all "SNumber". Is that something can be achieved? 
> Please let me know.
>
>       - source_labels: [__param_target]
>
>         target_label: SNumber
>
>         regex: 'hyper.(.*)com'
>
>         replacement: '${1}'
>
>       - source_labels: [__param_target]
>
>         target_label: SNumber
>
>         regex: 'linux.(.*)com'
>
>         replacement: '${1}'
>
>       - source_labels: [__param_target]
>
>         target_label: SNumber
>
>         regex: 'win.(.*)com'
>
>         replacement: '${1}'
>
>       - source_labels: [__param_target]
>
>         target_label: SNumber
>
>         regex: 'Router.(.*)com'
>
>         replacement: '${1}'
>
>
> Thanks
> Sandosh
>
> On Monday, August 29, 2022 at 4:46:33 PM UTC-5 Sandosh Kumar P wrote:
>
>> Hi Brian,
>>
>> This is how the target file looks like.
>>
>> *Hypervisor.yml:*
>> hyper.111.com
>> hyper.211.com
>> hyper.311.com
>>
>> *Instance.yml:*
>> linux.111.com
>> win.111.com
>> linux.211.com
>> win.211.com
>> linux.311.com
>> win.311.com
>>
>> *Router.yml:*
>> Router.111.com
>> Router.211.com
>> Router.311.com
>>
>>
>> Based on the above targets it need to group the targets like below based 
>> on site number:
>> *Group111:*
>> hyper.111.com
>> linux.111.com
>> win.111.com
>> Router.111.com
>>
>> *Group 211:*
>> hyper.211.com
>> linux.211.com
>> win.211.com
>> Router.211.com
>>
>> *Group 311:*
>> hyper.311.com
>> linux.311.com
>> win.311.com
>> Router.311.com
>>
>>
>> And then it needs to alert it based on their category 
>> (hypervisor/router/instances).
>>
>>    - If router is down on Group111 then it need to suppress hypervisor 
>>    and instance alerts.
>>    - If hypervisor is down on Group 111 then it need to suppress the 
>>    instance alerts.
>>    - If more than one group routers are down then it need to consolidate 
>>    all and send one alert for those groups.
>>
>>
>> I am going through the documentation to understand label_replace & other 
>> stuffs but I am not finding more examples or use cases that fit my scenario.
>>
>>
>> Thanks
>> Sandosh
>> On Monday, August 29, 2022 at 3:53:36 PM UTC-5 Brian Candler wrote:
>>
>>> What do you mean by "added to the targets"?  Can you give some examples?
>>>
>>> If the instance label contains both the instance name and the site name 
>>> and the structure is clearly demarked, then you can can use the function 
>>> label_replace(), as I said before, to extract the part of interest.
>>>
>>> e.g. if the hypervisor's instance label is "hyper3-site1" then you can 
>>> use label_replace to match the pattern "-site<N>" and return just the 
>>> "site<N>" part.  But the exact details of how to do this depend on exactly 
>>> what you're doing.
>>>
>>> See the documentation here: 
>>> https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace
>>> The example given matches a label like *service="xxx:yyy"* and adds a 
>>> new label *foo="xxx"*.  That's pretty much exactly what you're trying 
>>> to do, if I understand you correctly.
>>>
>>> Use the PromQL browser in the Prometheus web interface to test your 
>>> expressions as you write them.
>>>
>>> On Monday, 29 August 2022 at 17:41:33 UTC+1 [email protected] wrote:
>>>
>>>> In my case there are multiple sites located in different locations and 
>>>> each site as a unique number per that site added to the targets of 
>>>> hypervisor, router and instances. When I create an additional label in the 
>>>> rules files like in the previous configuration I have shared, it is 
>>>> grouping all the sites routers together, hypervisor together and all 
>>>> instances together. 
>>>>
>>>> What I am trying to achieve is to group all the targets with the same 
>>>> site numbers together and then on top of that I need to separate targets 
>>>> based on hypervisor, router & instances. Since I am new to prometheus I am 
>>>> getting stuck on how to separate them based on the unique number first and 
>>>> then later by the type. 
>>>>
>>>> And for the inhibit rules,  I will definitely make the said changes 
>>>> based on your recommendations. Let me know how can I achieve the above. 
>>>>
>>>>
>>>> On Thursday, August 25, 2022 at 10:25:52 AM UTC-5 Brian Candler wrote:
>>>>
>>>>> On Thursday, 25 August 2022 at 14:39:57 UTC+1 [email protected] 
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Since our targets has unique naming per cluster (For eg: router111, 
>>>>>> router 112, hypervisor111, hypervisor112, instance111, instance112), is 
>>>>>> there a way to group them based on their naming? Like all nodes which 
>>>>>> has 
>>>>>> 111 grouped together and 112 grouped together etc... Please let me know. 
>>>>>>
>>>>>>
>>>>> You can use the label_replace 
>>>>> <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>
>>>>>  
>>>>> function to extract the substring of interest into a new label.
>>>>>
>>>>> However I don't really understand what you're trying to do, because 
>>>>> presumably these are N:1 relationships (i.e. N VMs sharing one 
>>>>> hypervisor; 
>>>>> and N hypervisors sharing one gateway router). If you have router111, it 
>>>>> won't be serving just a single hypervisor111 running a single instance111.
>>>>>
>>>>>  
>>>>>
>>>>>> As per the below configuration, we are seeing only Router Down alerts 
>>>>>> if anything is added to Router group and it is suppressing even the 
>>>>>> valid 
>>>>>> alerts. Not sure what we are missing. 
>>>>>>
>>>>>> *...*
>>>>>>
>>>>>  
>>>>>
>>>>>> inhibit_rules:
>>>>>>
>>>>>>   - source_match:
>>>>>>
>>>>>>       severity: 'critical'
>>>>>>
>>>>>>     target_match:
>>>>>>
>>>>>>       severity: 'warning'
>>>>>>
>>>>>>     equal: ['alertname', 'dev', 'instance']
>>>>>>
>>>>>>
>>>>> The problem is that you haven't thought about your inhibit rules.
>>>>>
>>>>> All that you've written says: suppress any alert with label 
>>>>> severity="warning", if there is any active alert with label 
>>>>> severity="critical" and matching values of alertname, dev and instance 
>>>>> labels.
>>>>>
>>>>> What you want is something different: e.g. suppress any alert with 
>>>>> label alertname="H-InstanceDown", if there is any active alert with label 
>>>>> alertname="R-InstanceDown" and matching values of whatever label you have 
>>>>> set to identify the "site" that both the router and the hypervisor are 
>>>>> in.  
>>>>> It's up to you to write that in the form of an inhibit rule.
>>>>>
>>>>> Note that you can set additional labels on an alert, in the alerting 
>>>>> rule itself, if you need extra labels to be available to alertmanager.
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6fc3ac72-4165-4b7e-8091-88155c143f2fn%40googlegroups.com.

Reply via email to