Thank you for the postmortem! Such often contain very valuable lessons, and I am glad you chose to write it down and share it so openly. This deserves kudos!
On Tue, Sep 23, 2014 at 6:46 AM, Marc A. Pelletier <[email protected]> wrote: > [Or; an outage report in three acts] > > So, what happened over the last couple of days that have caused so many > small issues with the replica databases? In order to make that clear, > I'll explain a bit how the replicas are structured. > > At the dawn of time, the production replicas were set up as a > small-scale copy of how production itself is set up, with the various > project DBs split in seven "slices" to spread load. Those seven slices > ran on three (physical) servers, and each held a replica of its > production equivalent. (This is what everyone saw as "s1" - "s7"). > > Now, in order to allow tools that don't understand that more than one > database can live on the same physical server to work without needing > adaptation (to ease transition from the toolserver), I set up a set of > ugly networking rules[1] that made those three servers appear to be > seven different ones - allowing code to pretend that just changing the > address gets you to a different server. > > Enter MariaDB. > > Now, MariaDB is a very nice improvement for everyone: not only does it > allow us to mirror /every/ slice on all three servers (allowing easy > joins between databases), but it does so faster and more reliably than > vanilla mysql could thanks to a new database engine (TokuDB). What this > meant is that we no longer needed to run seven mysql instances but just > one per server, each having a copy of every production database. > > So, Sean (our DBA), set about converting our setup to MariaDB and > merging the databases that used to live on every server to a single one. > This worked well, with only minor problems caused by some slight > behaviour differences between mysql and mariadb or between innodb (the > previous database engine) and tokudb. Two of the servers were completed > that way with the third soon to be done once the kinks were worked out[2]. > > Fast forward several weeks and a second, unrelated issue was on the > plate to fix. You see, of the three database servers one had been set > up in the wrong place in the datacenter[3]; it worked, but because it > was there it kept needed special exceptions in the firewall rules which > was not only a maintenance issue, but was error prone and less secure. > > Fixing /that/ would be a simple thing; it only needs a short downtime > while someone actually physically hauls the hardware from one place in > the datacenter to another; and change its IP address. > > That went well, and in less than an hour the database was sitting > happily in its new rack with its new IP address. > > Now, at that point, the networking configuration needs to be changed > anyways, and since the databases had been merged[4], it was obvious that > this was the right time to rip out the ugly networking rules that had > become noops and by now just added a layer of needless complexity. > > That also went well, except for one niggling detail[5]: the databases on > the third servers never /did/ get merged like the other two. Removing > the networking rules had no effect on the first two (as expected) but > now only the first of three databases on the third was accessible. > > Worse: it *looks* like the other two databases are still happily working > since you apparently can still connect to them (but end up connected to > the wrong one). > > So, the change is made accompanied with some tests and all seems fine, > because, out of the dozen or so project databases I tested, I didn't > happen to test connecting to a database that used to be on the two out > of seven slices that are no longer visible. > > Monday comes, panic ensues. In the end, we decided to merge the > databases on the third server as the fix (that took around a day), and > we're back to working status with everything done. > > Like all good tales, this one has a moral[6]. No change is so obvious > that it doesn't require careful planning. The disruption over the > weekend was due only to the fact that I didn't take the time to double > check my assumptions because the change was "trivial". > > Or, as I learned while wiping the egg from my face, would have *been* > trivial if my assumptions matched reality. > > Exit sysadmin stage left, head hung low in shame at his hubris exposed. > > -- Marc > > [1] The "iptable rules" you may have heard mentionned on occasions. > Basically, just a set of NAT rules to redirect faux IPs standing in for > the servers to the right IP and port. > > [2] Pay attention here, that's some skillful foreshadowing right there. > > [3] Moved from one row of eqiad to another, for those keeping score. > > [4] If you've been following at home, you already see where this is > heading. > > [5] Also, the change was done on a Friday. "But it's just a trivial > change!" > > [6] Well, two morals if you count the "Don't do a change before you > leave for the weekend!" beating I also gave myself. > > > _______________________________________________ > Labs-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/labs-l >
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
