Jcrespo has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/270926

Change subject: Fix waiting for a binlog position when the binlog name has 
changed
......................................................................

Fix waiting for a binlog position when the binlog name has changed

When the binlog name has changed (for example, if the binlog format
changes, or -more likely- because we are now using a different
master after a master failover), binlog comparison will always
return false. If we have a new binlog format we will never be able
to compare lag. It is better in that case, assume that the slave is
catched up with the older binlog format (as any failover not doing
that will break many things otherwise).

Not checking this is causing certain jobs to be tried once and
again on Wikimedia servers, as the are waiting for a master binlog
position that will never arrive.

A more general fix would require a complete refactoring of slave
lag, as the current checks will fail equally for multi-tier slaves,
or active-active datacenters. For those, we would have to move all
slave lag checks to pt-heartbeat or base it in GTID.

However, this will fix the thousands of errors per minute that we
are currently suffering on wikimedia servers, while not breaking
backwards compatibility (unline newer methods of checking slave
lag.

Bug: T126436
Change-Id: I4e1529e6295bad0907fdc4e9817986ca6b4ddfb3
---
M includes/db/DatabaseMysqlBase.php
1 file changed, 21 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/mediawiki/core 
refs/changes/26/270926/1

diff --git a/includes/db/DatabaseMysqlBase.php 
b/includes/db/DatabaseMysqlBase.php
index c5aafea..0cb95f0 100644
--- a/includes/db/DatabaseMysqlBase.php
+++ b/includes/db/DatabaseMysqlBase.php
@@ -1440,6 +1440,14 @@
                        throw new InvalidArgumentException( "Position not an 
instance of " . __CLASS__ );
                }
 
+               // if the master has changed its binlog format, or we have 
performed a failover, 
+               // assume the slave it is up to date: T126436
+               $thisBinlogName = $this->getBinLogCommonName();
+               $thatBinlogName = $pos->getBinLogCommonName();
+               if ($thisBinLogName && $thatBinLogName && 
!strcmp($thisBinLogName, $thatBinLogName)) {
+                       return True;
+               }
+
                $thisPos = $this->getCoordinates();
                $thatPos = $pos->getCoordinates();
 
@@ -1462,4 +1470,17 @@
 
                return false;
        }
+
+       /**
+        * @return string|bool
+        */
+       protected function getBinLogCommonName() {
+                // TODO: implement better replication monitoring with
+               // heartbeat or GTIDs
+               if ( preg_match( '!^(\.+)\.(\d+)/(\d+)$!', (string)$this, $m ) 
) {
+                       return $m[1];
+               }
+
+               return false;
+       }
 }

-- 
To view, visit https://gerrit.wikimedia.org/r/270926
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I4e1529e6295bad0907fdc4e9817986ca6b4ddfb3
Gerrit-PatchSet: 1
Gerrit-Project: mediawiki/core
Gerrit-Branch: master
Gerrit-Owner: Jcrespo <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to