Ema has uploaded a new change for review. (
https://gerrit.wikimedia.org/r/337808 )
Change subject: varnish: icinga check for expiry mailbox lag
......................................................................
varnish: icinga check for expiry mailbox lag
We have found a correlation between the 503 errors described in T145661
and the varnish expiry thread not being able to catch up with its
mailbox.
Add an icinga check alerting when the lag grows beyond certain
thresholds.
Bug: T145661
Change-Id: I5e76b594d8c57fa9a679088c794b04c7879be715
---
A modules/varnish/files/check_varnish_expiry_mailbox_lag.sh
M modules/varnish/manifests/monitoring/instance.pp
2 files changed, 40 insertions(+), 0 deletions(-)
git pull ssh://gerrit.wikimedia.org:29418/operations/puppet
refs/changes/08/337808/1
diff --git a/modules/varnish/files/check_varnish_expiry_mailbox_lag.sh
b/modules/varnish/files/check_varnish_expiry_mailbox_lag.sh
new file mode 100755
index 0000000..7b59455
--- /dev/null
+++ b/modules/varnish/files/check_varnish_expiry_mailbox_lag.sh
@@ -0,0 +1,22 @@
+#!/bin/sh
+
+/usr/bin/varnishstat -1 | awk '
+/exp_mailed/ { m = $2 }
+/exp_received/ { r = $2 }
+
+END {
+ msg = "expiry mailbox lag is "
+ lag = m - r
+
+ if (lag > 10000) {
+ print "CRITICAL: " msg lag
+ exit 2
+ }
+ else if (lag > 1000) {
+ print "WARNING: " msg lag
+ exit 1
+ } else {
+ print "OK: " msg lag
+ exit 0
+ }
+}'
diff --git a/modules/varnish/manifests/monitoring/instance.pp
b/modules/varnish/manifests/monitoring/instance.pp
index f20feb3..e1d2e1f 100644
--- a/modules/varnish/manifests/monitoring/instance.pp
+++ b/modules/varnish/manifests/monitoring/instance.pp
@@ -4,4 +4,22 @@
description => "Varnish HTTP ${instance} - port ${port}",
check_command => "check_http_varnish!varnishcheck!${port}",
}
+
+ # We have found a correlation between the 503 errors described in T145661
+ # and the expiry thread not being able to catch up with its mailbox
+ file { '/usr/local/lib/nagios/plugins/check_varnish_expiry_mailbox_lag':
+ ensure => present,
+ source =>
'puppet:///modules/role/varnish/check_varnish_expiry_mailbox_lag.sh',
+ mode => '0555',
+ owner => 'root',
+ group => 'root',
+ }
+
+ nrpe::monitor_service { 'check_varnish_expiry_mailbox_lag':
+ description => "Check Varnish ${instance} expiry mailbox lag",
+ nrpe_command =>
'/usr/local/lib/nagios/plugins/check_varnish_expiry_mailbox_lag',
+ retry_interval => 30,
+ retries => 3,
+ require =>
File['/usr/local/lib/nagios/plugins/check_varnish_expiry_mailbox_lag'],
+ }
}
--
To view, visit https://gerrit.wikimedia.org/r/337808
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: newchange
Gerrit-Change-Id: I5e76b594d8c57fa9a679088c794b04c7879be715
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Ema <[email protected]>
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits