Ori.livneh has uploaded a new change for review.
https://gerrit.wikimedia.org/r/88009
Change subject: Add Icinga check for l10nupdate & drop !log-based alerts
......................................................................
Add Icinga check for l10nupdate & drop !log-based alerts
Localisation update is the only job that uses !log to issue alerts, to my
knowledge, and it does so indiscriminantly, spamming the server admin log with
both successes and failures. It's not clear to me that anyone takes its
failures very seriously, either. (The last run failed, for example.)
This patch adds an Icinga plug-in that checks the status of the localisation
cache. It issues a WARNING if the caches are >26 hours old, and a CRITICAL if
over >50. The numbers were chosen because l10nupdate runs once a day and takes
some arbitrary fraction of an hour to complete, and because failures are (in my
anecdotal experience) typically given another shot to self-correct before being
debugged.
With the Icinga check in effect, it would not be necessary (or appropriate) for
l10nupdate to use the SAL, so this patch also updates l10nupdate-1 to echo
verbose log messages to standard out instead. (The l10nupdate cron job
redirects stdout to a log file.)
Change-Id: I6eca6c063319a2663bb25d76a92709103c9dd88a
---
A files/icinga/check_l10n_cache
M files/misc/l10nupdate/l10nupdate-1
M manifests/misc/deployment.pp
M templates/icinga/checkcommands.cfg.erb
4 files changed, 76 insertions(+), 12 deletions(-)
git pull ssh://gerrit.wikimedia.org:29418/operations/puppet
refs/changes/09/88009/1
diff --git a/files/icinga/check_l10n_cache b/files/icinga/check_l10n_cache
new file mode 100755
index 0000000..28dc285
--- /dev/null
+++ b/files/icinga/check_l10n_cache
@@ -0,0 +1,53 @@
+#!/bin/bash
+#
+# Icinga plug-in for Wikimedia's MediaWiki localisation cache.
+#
+# This check should run on the host that runs l10nupdate. It ensures
+# that an up-to-date localisation cache directory exists for each
+# deployed version of MediaWiki. The check will report
+#
+# OK - If localisation caches are up-to-date.
+# WARNING - If the l10n cache files are more than 26 hours old.
+# CRITICAL - If the l10n cache files are more than 50 hours old.
+# UNKNOWN - If no MediaWiki versions are in use.
+# UNKNOWN - If no l10n cache directory exists for a version.
+#
+. /usr/local/lib/mw-deployment-vars.sh
+. $MW_COMMON_SOURCE/multiversion/MWRealm.sh
+
+versions=($(/usr/local/bin/mwversionsinuse))
+
+if [ -z "$versions" ]; then
+ echo "UNKNOWN: mwversionsinuse returned an empty list"
+ exit 3
+fi
+
+missing=()
+critical=()
+warning=()
+
+for version in "${versions[@]}"; do
+ l10n_dir="${MW_COMMON_SOURCE}/php-${version}/cache/l10n"
+ if [ ! -d "$l10n_dir" ]; then
+ missing+=("$version")
+ elif [ ! `find "${l10n_dir}" -mtime -2.1 -print -quit` ]; then
+ critical+=("$version")
+ elif [ ! `find "${l10n_dir}" -mtime -1.1 -print -quit` ]; then
+ warning+=("$version")
+ fi
+done
+
+IFS=,
+if [ -n "$critical" ]; then
+ echo "CRITICAL: localisation cache is more than 50 hours old --
${critical[*]}"
+ exit 2
+elif [ -n "$warning" ]; then
+ echo "WARNING: localisation cache is more than 50 hours old --
${warning[*]}"
+ exit 1
+elif [ -n "$missing" ]; then
+ echo "UNKNOWN: localisation cache directory is missing -- ${missing[*]}"
+ exit 3
+fi
+
+echo "OK: localisation caches are up-to-date."
+exit 0
diff --git a/files/misc/l10nupdate/l10nupdate-1
b/files/misc/l10nupdate/l10nupdate-1
index 95c939b..2328b8f 100755
--- a/files/misc/l10nupdate/l10nupdate-1
+++ b/files/misc/l10nupdate/l10nupdate-1
@@ -23,8 +23,7 @@
then
echo "Updated $path"
else
- $BINDIR/dologmsg "!log LocalisationUpdate failed: git
pull of $path failed"
- echo "Updating $path FAILED."
+ echo "LocalisationUpdate failed: git pull of $path
failed"
exit 1
fi
else
@@ -36,8 +35,7 @@
then
echo "Cloned $path"
else
- $BINDIR/dologmsg "!log LocalisationUpdate failed: git
clone of $path failed"
- echo "Cloning $path FAILED."
+ echo "LocalisationUpdate failed: git clone of $path
failed"
exit 1
fi
fi
@@ -47,8 +45,7 @@
# Get all MW message cache versions (and a wiki DB name for each)
mwVerDbSets=$($BINDIR/mwversionsinuse --extended --withdb)
if [ -z "$mwVerDbSets" ]; then
- $BINDIR/dologmsg "!log LocalisationUpdate failed: mwversionsinuse
returned empty list"
- echo "Obtaining MediaWiki version list FAILED"
+ echo "LocalisationUpdate failed: mwversionsinuse returned empty list"
exit 1
fi
@@ -79,11 +76,9 @@
cp --preserve=timestamps --force
/var/lib/l10nupdate/cache-"$mwVerNum"/l10n_cache-*
$MW_COMMON_SOURCE/php-"$mwVerNum"/cache/l10n
echo "Syncing to Apaches"
$BINDIR/sync-l10nupdate-1 "$mwVerNum"
- $BINDIR/dologmsg "!log LocalisationUpdate completed ($mwVerNum)
at `date`"
- echo "All done"
+ echo "LocalisationUpdate completed ($mwVerNum) at `date`"
else
- $BINDIR/dologmsg "!log LocalisationUpdate failed ($mwVerNum) at
`date`"
- echo "FAILED"
+ echo "LocalisationUpdate failed ($mwVerNum) at `date`"
fi
done
@@ -93,5 +88,4 @@
for wiki in `<"$ALLDB"`; do
/usr/local/bin/mwscript
extensions/WikimediaMaintenance/refreshMessageBlobs.php --wiki="$wiki"
done
-echo "All done"
-$BINDIR/dologmsg "!log LocalisationUpdate ResourceLoader cache refresh
completed at `date`"
+echo "LocalisationUpdate ResourceLoader cache refresh completed at `date`"
diff --git a/manifests/misc/deployment.pp b/manifests/misc/deployment.pp
index 04f71ae..7742062 100644
--- a/manifests/misc/deployment.pp
+++ b/manifests/misc/deployment.pp
@@ -301,6 +301,19 @@
ensure => present;
}
+ file { '/usr/lib/nagios/plugins/check_l10n_cache':
+ source => 'puppet:///files/icinga/check_l10n_cache',
+ mode => '0755',
+ }
+
+ nrpe::monitor_service { 'l10nupdate':
+ ensure => 'present',
+ description => 'Ensure localisation caches are up-to-date',
+ nrpe_command => '/usr/lib/nagios/plugins/check_l10n_cache',
+ require => File['/usr/lib/nagios/plugins/check_l10n_cache'],
+ contact_group => 'admins',
+ }
+
file {
"${scriptpath}/l10nupdate":
owner => root,
diff --git a/templates/icinga/checkcommands.cfg.erb
b/templates/icinga/checkcommands.cfg.erb
index 259022c..47c3c84 100644
--- a/templates/icinga/checkcommands.cfg.erb
+++ b/templates/icinga/checkcommands.cfg.erb
@@ -418,6 +418,10 @@
command_name check_eventlogging_jobs
command_line /usr/lib/nagios/plugins/check_eventlogging_jobs
}
+define command{
+ command_name check_l10n_cache
+ command_line /usr/lib/nagios/plugins/check_l10n_cache
+}
#Generic NRPE check
--
To view, visit https://gerrit.wikimedia.org/r/88009
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: newchange
Gerrit-Change-Id: I6eca6c063319a2663bb25d76a92709103c9dd88a
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Ori.livneh <[email protected]>
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits