[Or; an outage report in three acts] So, what happened over the last couple of days that have caused so many small issues with the replica databases? In order to make that clear, I'll explain a bit how the replicas are structured.
At the dawn of time, the production replicas were set up as a small-scale copy of how production itself is set up, with the various project DBs split in seven "slices" to spread load. Those seven slices ran on three (physical) servers, and each held a replica of its production equivalent. (This is what everyone saw as "s1" - "s7"). Now, in order to allow tools that don't understand that more than one database can live on the same physical server to work without needing adaptation (to ease transition from the toolserver), I set up a set of ugly networking rules[1] that made those three servers appear to be seven different ones - allowing code to pretend that just changing the address gets you to a different server. Enter MariaDB. Now, MariaDB is a very nice improvement for everyone: not only does it allow us to mirror /every/ slice on all three servers (allowing easy joins between databases), but it does so faster and more reliably than vanilla mysql could thanks to a new database engine (TokuDB). What this meant is that we no longer needed to run seven mysql instances but just one per server, each having a copy of every production database. So, Sean (our DBA), set about converting our setup to MariaDB and merging the databases that used to live on every server to a single one. This worked well, with only minor problems caused by some slight behaviour differences between mysql and mariadb or between innodb (the previous database engine) and tokudb. Two of the servers were completed that way with the third soon to be done once the kinks were worked out[2]. Fast forward several weeks and a second, unrelated issue was on the plate to fix. You see, of the three database servers one had been set up in the wrong place in the datacenter[3]; it worked, but because it was there it kept needed special exceptions in the firewall rules which was not only a maintenance issue, but was error prone and less secure. Fixing /that/ would be a simple thing; it only needs a short downtime while someone actually physically hauls the hardware from one place in the datacenter to another; and change its IP address. That went well, and in less than an hour the database was sitting happily in its new rack with its new IP address. Now, at that point, the networking configuration needs to be changed anyways, and since the databases had been merged[4], it was obvious that this was the right time to rip out the ugly networking rules that had become noops and by now just added a layer of needless complexity. That also went well, except for one niggling detail[5]: the databases on the third servers never /did/ get merged like the other two. Removing the networking rules had no effect on the first two (as expected) but now only the first of three databases on the third was accessible. Worse: it *looks* like the other two databases are still happily working since you apparently can still connect to them (but end up connected to the wrong one). So, the change is made accompanied with some tests and all seems fine, because, out of the dozen or so project databases I tested, I didn't happen to test connecting to a database that used to be on the two out of seven slices that are no longer visible. Monday comes, panic ensues. In the end, we decided to merge the databases on the third server as the fix (that took around a day), and we're back to working status with everything done. Like all good tales, this one has a moral[6]. No change is so obvious that it doesn't require careful planning. The disruption over the weekend was due only to the fact that I didn't take the time to double check my assumptions because the change was "trivial". Or, as I learned while wiping the egg from my face, would have *been* trivial if my assumptions matched reality. Exit sysadmin stage left, head hung low in shame at his hubris exposed. -- Marc [1] The "iptable rules" you may have heard mentionned on occasions. Basically, just a set of NAT rules to redirect faux IPs standing in for the servers to the right IP and port. [2] Pay attention here, that's some skillful foreshadowing right there. [3] Moved from one row of eqiad to another, for those keeping score. [4] If you've been following at home, you already see where this is heading. [5] Also, the change was done on a Friday. "But it's just a trivial change!" [6] Well, two morals if you count the "Don't do a change before you leave for the weekend!" beating I also gave myself. _______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
