Elukey has submitted this change and it was merged. ( https://gerrit.wikimedia.org/r/391798 )
Change subject: profile::redis::jobqueue: stagger redis slave restarts ...................................................................... profile::redis::jobqueue: stagger redis slave restarts In T163337 a long investigation was made to figure out why the Redis Jobqueue shards get out of sync with their masters after some hours of work. We didn't find a permant fix for the issue, since it would have involved a major Redis upgrade in production and probably a review of all the Lua scripts that we are actually running, so we added daily restarts of all the Redis slaves at 1 AM. Configuration example for a Redis shard: rdb1001:shard1 --> master running in eqiad rdb1002:shard1 --> slave of rdb1001:shard1 running in eqiad rdb2001:shard1 --> slave of rdb1001:shard1 running in codfw rdb2002:shard1 --> slave of rdb2001:shard1 running in codfw At 1 AM all three slaves try to issue A SYNC to their masters, and it seems that this puts pressure on the eqiad masters. This patch forces eqiad slaves to be restarted at 1AM, meanwhile the codfw ones at 2AM. It also adds some sleep time between each redis shard restart, since restarting all the shards on slave in one go trigger multiple SYNC requests to the master host shards (that might hit the disk performances). Bug: T179684 Change-Id: I58f1fb4b16f5947eecd0f89b075471e335e45de6 --- M modules/profile/files/redis/restart-redis-if-slave.sh M modules/profile/manifests/redis/jobqueue.pp M modules/profile/manifests/redis/jobqueue_slave.pp 3 files changed, 20 insertions(+), 2 deletions(-) Approvals: Mobrovac: Looks good to me, but someone else must approve Elukey: Looks good to me, approved jenkins-bot: Verified diff --git a/modules/profile/files/redis/restart-redis-if-slave.sh b/modules/profile/files/redis/restart-redis-if-slave.sh index 9627146..dc8c64a 100755 --- a/modules/profile/files/redis/restart-redis-if-slave.sh +++ b/modules/profile/files/redis/restart-redis-if-slave.sh @@ -1,6 +1,9 @@ #!/bin/bash set -e +# Random sleep to stagger execution of this script +sleep $(($RANDOM % 600)) + # Check if currently a slave for instance in "$@"; do @@ -8,5 +11,8 @@ authpass=$(awk '{if ($1 == "requirepass") print $2}' "$_config") if redis-cli -h 127.0.0.1 -p "$instance" -a "$authpass" INFO replication | grep -q role:slave; then systemctl restart "redis-instance-tcp_${instance}.service" + # Avoid multiple SYNC requests to the master shards at the same time + # (that might hit disk performances and slow down the master host). + sleep 180 fi done diff --git a/modules/profile/manifests/redis/jobqueue.pp b/modules/profile/manifests/redis/jobqueue.pp index da37315..6d3fa67 100644 --- a/modules/profile/manifests/redis/jobqueue.pp +++ b/modules/profile/manifests/redis/jobqueue.pp @@ -15,9 +15,15 @@ } $instance_str = join($::profile::redis::multidc::instances, ' ') + + $restart_hour = $::site ? { + 'codfw' => 2, + 'default' => 1, + } + cron { 'jobqueue-redis-conditional-restart': command => "/usr/local/bin/restart-redis-if-slave ${instance_str}", - hour => 1, + hour => $restart_hour, minute => 0, } } diff --git a/modules/profile/manifests/redis/jobqueue_slave.pp b/modules/profile/manifests/redis/jobqueue_slave.pp index 39b3f59..fbec65b 100644 --- a/modules/profile/manifests/redis/jobqueue_slave.pp +++ b/modules/profile/manifests/redis/jobqueue_slave.pp @@ -14,9 +14,15 @@ group => 'root', } $instance_str = join($::profile::redis::slave::instances, ' ') + + $restart_hour = $::site ? { + 'codfw' => 2, + 'default' => 1, + } + cron { 'jobqueue-redis-conditional-restart': command => "/usr/local/bin/restart-redis-if-slave ${instance_str}", - hour => 1, + hour => $restart_hour, minute => 0, } } -- To view, visit https://gerrit.wikimedia.org/r/391798 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: merged Gerrit-Change-Id: I58f1fb4b16f5947eecd0f89b075471e335e45de6 Gerrit-PatchSet: 12 Gerrit-Project: operations/puppet Gerrit-Branch: production Gerrit-Owner: Elukey <ltosc...@wikimedia.org> Gerrit-Reviewer: Elukey <ltosc...@wikimedia.org> Gerrit-Reviewer: Giuseppe Lavagetto <glavage...@wikimedia.org> Gerrit-Reviewer: Mobrovac <mobro...@wikimedia.org> Gerrit-Reviewer: jenkins-bot <> _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits