Elukey has submitted this change and it was merged. ( 
https://gerrit.wikimedia.org/r/391798 )

Change subject: profile::redis::jobqueue: stagger redis slave restarts
......................................................................


profile::redis::jobqueue: stagger redis slave restarts

In T163337 a long investigation was made to figure out
why the Redis Jobqueue shards get out of sync with their
masters after some hours of work. We didn't find a permant
fix for the issue, since it would have involved a major
Redis upgrade in production and probably a review of all
the Lua scripts that we are actually running, so we added
daily restarts of all the Redis slaves at 1 AM.

Configuration example for a Redis shard:

rdb1001:shard1 --> master running in eqiad
rdb1002:shard1 --> slave of rdb1001:shard1 running in eqiad
rdb2001:shard1 --> slave of rdb1001:shard1 running in codfw
rdb2002:shard1 --> slave of rdb2001:shard1 running in codfw

At 1 AM all three slaves try to issue A SYNC to their masters,
and it seems that this puts pressure on the eqiad masters.

This patch forces eqiad slaves to be restarted at 1AM, meanwhile
the codfw ones at 2AM. It also adds some sleep time between each
redis shard restart, since restarting all the shards on slave in
one go trigger multiple SYNC requests to the master host shards
(that might hit the disk performances).

Bug: T179684
Change-Id: I58f1fb4b16f5947eecd0f89b075471e335e45de6
---
M modules/profile/files/redis/restart-redis-if-slave.sh
M modules/profile/manifests/redis/jobqueue.pp
M modules/profile/manifests/redis/jobqueue_slave.pp
3 files changed, 20 insertions(+), 2 deletions(-)

Approvals:
  Mobrovac: Looks good to me, but someone else must approve
  Elukey: Looks good to me, approved
  jenkins-bot: Verified



diff --git a/modules/profile/files/redis/restart-redis-if-slave.sh 
b/modules/profile/files/redis/restart-redis-if-slave.sh
index 9627146..dc8c64a 100755
--- a/modules/profile/files/redis/restart-redis-if-slave.sh
+++ b/modules/profile/files/redis/restart-redis-if-slave.sh
@@ -1,6 +1,9 @@
 #!/bin/bash
 set -e
 
+# Random sleep to stagger execution of this script
+sleep $(($RANDOM % 600))
+
 # Check if currently a slave
 for instance in "$@";
 do
@@ -8,5 +11,8 @@
     authpass=$(awk '{if ($1 == "requirepass") print $2}' "$_config")
     if redis-cli -h 127.0.0.1 -p "$instance" -a "$authpass" INFO replication | 
grep -q role:slave; then
         systemctl restart "redis-instance-tcp_${instance}.service"
+        # Avoid multiple SYNC requests to the master shards at the same time
+        # (that might hit disk performances and slow down the master host).
+        sleep 180
     fi
 done
diff --git a/modules/profile/manifests/redis/jobqueue.pp 
b/modules/profile/manifests/redis/jobqueue.pp
index da37315..6d3fa67 100644
--- a/modules/profile/manifests/redis/jobqueue.pp
+++ b/modules/profile/manifests/redis/jobqueue.pp
@@ -15,9 +15,15 @@
     }
 
     $instance_str = join($::profile::redis::multidc::instances, ' ')
+
+    $restart_hour  = $::site ? {
+        'codfw'   => 2,
+        'default' => 1,
+    }
+
     cron { 'jobqueue-redis-conditional-restart':
         command => "/usr/local/bin/restart-redis-if-slave ${instance_str}",
-        hour    => 1,
+        hour    => $restart_hour,
         minute  => 0,
     }
 }
diff --git a/modules/profile/manifests/redis/jobqueue_slave.pp 
b/modules/profile/manifests/redis/jobqueue_slave.pp
index 39b3f59..fbec65b 100644
--- a/modules/profile/manifests/redis/jobqueue_slave.pp
+++ b/modules/profile/manifests/redis/jobqueue_slave.pp
@@ -14,9 +14,15 @@
         group  => 'root',
     }
     $instance_str = join($::profile::redis::slave::instances, ' ')
+
+    $restart_hour  = $::site ? {
+        'codfw'   => 2,
+        'default' => 1,
+    }
+
     cron { 'jobqueue-redis-conditional-restart':
         command => "/usr/local/bin/restart-redis-if-slave ${instance_str}",
-        hour    => 1,
+        hour    => $restart_hour,
         minute  => 0,
     }
 }

-- 
To view, visit https://gerrit.wikimedia.org/r/391798
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I58f1fb4b16f5947eecd0f89b075471e335e45de6
Gerrit-PatchSet: 12
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Elukey <ltosc...@wikimedia.org>
Gerrit-Reviewer: Elukey <ltosc...@wikimedia.org>
Gerrit-Reviewer: Giuseppe Lavagetto <glavage...@wikimedia.org>
Gerrit-Reviewer: Mobrovac <mobro...@wikimedia.org>
Gerrit-Reviewer: jenkins-bot <>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to