Gehel has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/349168 )

Change subject: logstash - raise elasticsearch shard alert threshold to 34
......................................................................

logstash - raise elasticsearch shard alert threshold to 34

After some discussion, this seems to be a good compromise. It allows to loose
one node (1/3 of the shards) without raising an alert.

Change-Id: I352e7230c6bd42f5d1a2fa84f33675e0a6ce8225
---
M modules/elasticsearch/manifests/nagios/check.pp
M modules/nagios_common/files/checkcommands.cfg
M modules/role/manifests/logstash/elasticsearch.pp
3 files changed, 21 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/operations/puppet 
refs/changes/68/349168/1

diff --git a/modules/elasticsearch/manifests/nagios/check.pp 
b/modules/elasticsearch/manifests/nagios/check.pp
index ce651c7..5723ed7 100644
--- a/modules/elasticsearch/manifests/nagios/check.pp
+++ b/modules/elasticsearch/manifests/nagios/check.pp
@@ -3,9 +3,15 @@
 # Make sure your Nagios/Icinga node has included
 # the elasticsearch::nagios::plugin class.
 #
-class elasticsearch::nagios::check {
+# [*threshold*]
+#   The percentage of inactive shards to check (initializing / relocating /
+#   unassigned).
+#   Default: 0.1
+class elasticsearch::nagios::check(
+    $threshold = '0.1',
+) {
     monitoring::service { 'elasticsearch shards':
-        check_command => 'check_elasticsearch_shards',
+        check_command => "check_elasticsearch_shards_threshold!${threshold}",
         description   => 'ElasticSearch health check for shards',
     }
 }
diff --git a/modules/nagios_common/files/checkcommands.cfg 
b/modules/nagios_common/files/checkcommands.cfg
index da9af15..fca3c15 100644
--- a/modules/nagios_common/files/checkcommands.cfg
+++ b/modules/nagios_common/files/checkcommands.cfg
@@ -477,6 +477,11 @@
     command_line    $USER1$/check_elasticsearch.py --ignore-status --url 
http://$HOSTADDRESS$:9200
 }
 
+define command {
+    command_name    check_elasticsearch_shards_threshold
+    command_line    $USER1$/check_elasticsearch.py --ignore-status --url 
http://$HOSTADDRESS$:9200 --shards-inactive $ARG1$
+}
+
 # Analytics Cluster Checks
 
 define command {
diff --git a/modules/role/manifests/logstash/elasticsearch.pp 
b/modules/role/manifests/logstash/elasticsearch.pp
index 26b97a9..3dadbe1 100644
--- a/modules/role/manifests/logstash/elasticsearch.pp
+++ b/modules/role/manifests/logstash/elasticsearch.pp
@@ -5,10 +5,17 @@
 #
 class role::logstash::elasticsearch {
     include ::standard
-    include ::elasticsearch::nagios::check
     include ::elasticsearch::monitor::diamond
     include ::base::firewall
 
+    # the logstash cluster has 3 data nodes, and each shard has 3 replica (each
+    #shard is present on each node). If one node is lost, 1/3 of the shards
+    # will be unassigned, with no way to reallocate them on another node, which
+    # is fine and should not raise an alert. So threshold needs to be > 1/3.
+    class { '::elasticsearch::nagios::check':
+        threshold => '34',
+    }
+
     if $::standard::has_ganglia {
         include ::elasticsearch::ganglia
     }

-- 
To view, visit https://gerrit.wikimedia.org/r/349168
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I352e7230c6bd42f5d1a2fa84f33675e0a6ce8225
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Gehel <guillaume.leder...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to