[prometheus-users] Alert Rules expression value not getting changed to expected value after evaluation_interval

Aarti Nagdev Thu, 04 Jun 2020 01:03:26 -0700

I am trying to do scaling operation with Prometheus Alert Manager. I have 
configured a webhook receiver in Alert Manager to execute my ScaleUp and 
ScaleDown action.I have the following things set in configuration:
evaluation_interval: 1m
scrape_interval: 1s
There is no grouping in Alert Manager.Below are the metrics used :
current_replica is a metric that gives instantaneous value of number of 
PODS running.
min_replica and max_replica are the metrics that have been set to avoid 
scaling replicas down and up respectively beyond that count.Alert Rule goes 
like below:
1. Scaledown
expr: 
((sum(rate(workmanager_completed_requests{weblogic_domainName="xface-domain",weblogic_clusterName="EPICluster",name="EPIClusterWorkManager",applicationName="WDTTestEAR"}[30s]))
 
< bool sum(replica_exporter_target_avg_req_cr{cluster_name="EPICluster"})) 
+ (sum(replica_exporter_current_replica_count{cluster_name="EPICluster"}) > 
bool sum(replica_exporter_min_count_cr{cluster_name="EPICluster"}))) == 2


  [ (tps < bool 5 + current_replicas > bool min_replicas) == 2 ]

2. Scaleup
  expr: 
((sum(rate(workmanager_completed_requests{applicationName="WDTTestEAR",name="EPIClusterWorkManager",weblogic_clusterName="EPICluster",weblogic_domainName="xface-domain"}[30s]))
 
> bool sum(replica_exporter_target_avg_req_cr{cluster_name="EPICluster"}))  
+ (sum(replica_exporter_current_replica_count{cluster_name="EPICluster"}) < 
bool sum(replica_exporter_max_count_cr{cluster_name="EPICluster"})))  == 2

[ (tps > bool 5 + current_replicas bool < max_replicas) == 2 ]

Consider the following scenario:
min_replica = 2
max_replica = 4
There is no load on the applications in the POD ( tps = 0 )
Step 1. When 2 Pods running, scaledown rule is evaluated to false and no 
alerts are fired - expected behaviour
current_replica = 2
min_replica =2
Step 2. Now I bring up one more replica making current_replica to 3, 
scaledown action gets evaluated to true
because tps < 5 and current_replica > min_replica
Alert goes into "Firing" state and it brings down one replica.
Here alert stays into "Firing" state for that evaluation cycle - expected 
behaviour
Step 3. After 1m ( at the end of previous evaluation cycle ), scaledown 
action still gets evaluated to true even though current_replica is 2 now 
and one more replica is brought down, however alert state changes from 
"Firing" to "Inactive" - unexpected behaviour

I don't understand why alert rule gets evaluated to true in Step 3, when 
the current number of running replicas is 2.
To confirm, after scaledown in Step 2, I even checked alert rule expression 
in Prometheus dashboard, value is not evaluaed to 2, hence the expression 
should return false.Below are the webhook logs where I have printed the 
Rule Evalution Value of every cycle ( accessed from "annotations" under 
rules ), you can see that metric value is 2 for both cycles.Webhook Logs :
CURRENT_VALUE is ScaleDown Action Metric Value is 2
[webhook] 2020/06/04 06:47:26 finished handling scaledownCURRENT_VALUE is 
ScaleDown Action Metric Value is 2
[webhook] 2020/06/04 06:48:26 finished handling scaledown

Can someone help me here with understanding this behaviour ?

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c3b33a7f-a561-48c4-be92-4f985920c3e1%40googlegroups.com.

[prometheus-users] Alert Rules expression value not getting changed to expected value after evaluation_interval

Reply via email to