Gehel has uploaded a new change for review.
https://gerrit.wikimedia.org/r/290487
Change subject: Increase time before alter for elasticsearch disk space issues
......................................................................
Increase time before alter for elasticsearch disk space issues
We are getting too many alerts abotu disk space on Elasticsearch. In most
cases, Elasticsearch will rebalance the shards correctly on its own. This is
an attempt to increase `max_check_attempts` to see if we can reduce the
number of alerts, while still getting them for cases were we need to react.
Change-Id: I13c0974af9b7a61f3a0dd61a0e290521c1748fa7
---
M hieradata/role/common/elasticsearch/server.yaml
M modules/base/manifests/monitoring/host.pp
2 files changed, 8 insertions(+), 0 deletions(-)
git pull ssh://gerrit.wikimedia.org:29418/operations/puppet
refs/changes/87/290487/1
diff --git a/hieradata/role/common/elasticsearch/server.yaml
b/hieradata/role/common/elasticsearch/server.yaml
index 8593838..511d2f0 100644
--- a/hieradata/role/common/elasticsearch/server.yaml
+++ b/hieradata/role/common/elasticsearch/server.yaml
@@ -37,3 +37,9 @@
# T130329
base::monitoring::host::nrpe_check_disk_options: -w 18% -c 15% -l -e -A -i
"/srv/sd[a-b][1-3]" --exclude-type=tracefs
+# We do expect Elastic search to get close to its high watermark. In most cases
+# it will rebalance shards fairly quickly and we do not need to react to this
+# alert. To reduce alert spam we increase the max_check_attempts.
+# retry_interval is configured by default at 1 minute, so we should get an
+# alert after max_check_attempts minutes.
+base::monitoring::host::nrpe_check_disk_max_check_attempts: 30
\ No newline at end of file
diff --git a/modules/base/manifests/monitoring/host.pp
b/modules/base/manifests/monitoring/host.pp
index 752235c..7ad2283 100644
--- a/modules/base/manifests/monitoring/host.pp
+++ b/modules/base/manifests/monitoring/host.pp
@@ -27,6 +27,7 @@
# that are purposefully at 99%. Better ideas are welcome.
$nrpe_check_disk_options = '-w 6% -c 3% -l -e -A -i "/srv/sd[a-b][1-3]"
--exclude-type=tracefs',
$nrpe_check_disk_critical = false,
+ $nrpe_check_disk_max_check_attempts = 3,
) {
include base::puppet::params # In order to be able to use some variables
@@ -113,6 +114,7 @@
description => 'Disk space',
critical => $nrpe_check_disk_critical,
nrpe_command => "/usr/lib/nagios/plugins/check_disk
${nrpe_check_disk_options}",
+ retries => $nrpe_check_disk_max_check_attempts,
}
nrpe::monitor_service { 'dpkg':
--
To view, visit https://gerrit.wikimedia.org/r/290487
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: newchange
Gerrit-Change-Id: I13c0974af9b7a61f3a0dd61a0e290521c1748fa7
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Gehel <[email protected]>
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits