Gehel has uploaded a new change for review. (
https://gerrit.wikimedia.org/r/346710 )
Change subject: maps - increase number of retries before alert for posttgresql
lag check
......................................................................
maps - increase number of retries before alert for posttgresql lag check
This should reduce the number of false positive pending more investigation
into the root cause of this.
Bug: T162345
Change-Id: I13eef098a586ac770b76ecfa707ff2c4a4aaa045
---
M modules/role/manifests/maps/slave.pp
1 file changed, 7 insertions(+), 0 deletions(-)
git pull ssh://gerrit.wikimedia.org:29418/operations/puppet
refs/changes/10/346710/1
diff --git a/modules/role/manifests/maps/slave.pp
b/modules/role/manifests/maps/slave.pp
index b0ae6d7..1a9787f 100644
--- a/modules/role/manifests/maps/slave.pp
+++ b/modules/role/manifests/maps/slave.pp
@@ -17,8 +17,15 @@
$warning = 300
$command = "/usr/lib/nagios/plugins/check_postgres_replication_lag.py \
-U replication -P ${replication_pass} -m ${master} -D template1 -C ${critical}
-W ${warning}"
+
+ # This check generate a number of alerts, which recover quickly. It looks
+ # like lag suddenly jumps from 0 to a high number (multiple hours) and goes
+ # back to zero quickly. Increasing the number of retries will reduce the
+ # number of false positive while we investigate a better solution. See
+ # T162345 for details.
nrpe::monitor_service { 'postgres-rep-lag':
description => 'Postgres Replication Lag',
nrpe_command => $command,
+ retries => 10,
}
}
--
To view, visit https://gerrit.wikimedia.org/r/346710
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: newchange
Gerrit-Change-Id: I13eef098a586ac770b76ecfa707ff2c4a4aaa045
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Gehel <[email protected]>
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits