I am trying to do scaling operation with Prometheus Alert Manager. I have
configured a webhook receiver in Alert Manager to execute my ScaleUp and
ScaleDown action.I have the following things set in configuration:
evaluation_interval: 1m
scrape_interval: 1s
There is no grouping in Alert Manager.Below are the metrics used :
current_replica is a metric that gives instantaneous value of number of
PODS running.
min_replica and max_replica are the metrics that have been set to avoid
scaling replicas down and up respectively beyond that count.Alert Rule goes
like below:
1. Scaledown
expr:
((sum(rate(workmanager_completed_requests{weblogic_domainName="xface-domain",weblogic_clusterName="EPICluster",name="EPIClusterWorkManager",applicationName="WDTTestEAR"}[30s]))
< bool sum(replica_exporter_target_avg_req_cr{cluster_name="EPICluster"}))
+ (sum(replica_exporter_current_replica_count{cluster_name="EPICluster"}) >
bool sum(replica_exporter_min_count_cr{cluster_name="EPICluster"}))) == 2
[ (tps < bool 5 + current_replicas > bool min_replicas) == 2 ]
2. Scaleup
expr:
((sum(rate(workmanager_completed_requests{applicationName="WDTTestEAR",name="EPIClusterWorkManager",weblogic_clusterName="EPICluster",weblogic_domainName="xface-domain"}[30s]))
> bool sum(replica_exporter_target_avg_req_cr{cluster_name="EPICluster"}))
+ (sum(replica_exporter_current_replica_count{cluster_name="EPICluster"}) <
bool sum(replica_exporter_max_count_cr{cluster_name="EPICluster"}))) == 2
[ (tps > bool 5 + current_replicas bool < max_replicas) == 2 ]
Consider the following scenario:
min_replica = 2
max_replica = 4
There is no load on the applications in the POD ( tps = 0 )
Step 1. When 2 Pods running, scaledown rule is evaluated to false and no
alerts are fired - expected behaviour
current_replica = 2
min_replica =2
Step 2. Now I bring up one more replica making current_replica to 3,
scaledown action gets evaluated to true
because tps < 5 and current_replica > min_replica
Alert goes into "Firing" state and it brings down one replica.
Here alert stays into "Firing" state for that evaluation cycle - expected
behaviour
Step 3. After 1m ( at the end of previous evaluation cycle ), scaledown
action still gets evaluated to true even though current_replica is 2 now
and one more replica is brought down, however alert state changes from
"Firing" to "Inactive" - unexpected behaviour
I don't understand why alert rule gets evaluated to true in Step 3, when
the current number of running replicas is 2.
To confirm, after scaledown in Step 2, I even checked alert rule expression
in Prometheus dashboard, value is not evaluaed to 2, hence the expression
should return false.Below are the webhook logs where I have printed the
Rule Evalution Value of every cycle ( accessed from "annotations" under
rules ), you can see that metric value is 2 for both cycles.Webhook Logs :
CURRENT_VALUE is ScaleDown Action Metric Value is 2
[webhook] 2020/06/04 06:47:26 finished handling scaledownCURRENT_VALUE is
ScaleDown Action Metric Value is 2
[webhook] 2020/06/04 06:48:26 finished handling scaledown
Can someone help me here with understanding this behaviour ?
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/c3b33a7f-a561-48c4-be92-4f985920c3e1%40googlegroups.com.