Second that, getting more visibility in how labs is set up is very educational.
On Tue, Sep 23, 2014 at 9:14 AM, Denny Vrandečić <[email protected]> wrote: > Thank you for the postmortem! Such often contain very valuable lessons, > and I am glad you chose to write it down and share it so openly. This > deserves kudos! > > > > On Tue, Sep 23, 2014 at 6:46 AM, Marc A. Pelletier <[email protected]> > wrote: > >> [Or; an outage report in three acts] >> >> So, what happened over the last couple of days that have caused so many >> small issues with the replica databases? In order to make that clear, >> I'll explain a bit how the replicas are structured. >> >> At the dawn of time, the production replicas were set up as a >> small-scale copy of how production itself is set up, with the various >> project DBs split in seven "slices" to spread load. Those seven slices >> ran on three (physical) servers, and each held a replica of its >> production equivalent. (This is what everyone saw as "s1" - "s7"). >> >> Now, in order to allow tools that don't understand that more than one >> database can live on the same physical server to work without needing >> adaptation (to ease transition from the toolserver), I set up a set of >> ugly networking rules[1] that made those three servers appear to be >> seven different ones - allowing code to pretend that just changing the >> address gets you to a different server. >> >> Enter MariaDB. >> >> Now, MariaDB is a very nice improvement for everyone: not only does it >> allow us to mirror /every/ slice on all three servers (allowing easy >> joins between databases), but it does so faster and more reliably than >> vanilla mysql could thanks to a new database engine (TokuDB). What this >> meant is that we no longer needed to run seven mysql instances but just >> one per server, each having a copy of every production database. >> >> So, Sean (our DBA), set about converting our setup to MariaDB and >> merging the databases that used to live on every server to a single one. >> This worked well, with only minor problems caused by some slight >> behaviour differences between mysql and mariadb or between innodb (the >> previous database engine) and tokudb. Two of the servers were completed >> that way with the third soon to be done once the kinks were worked out[2]. >> >> Fast forward several weeks and a second, unrelated issue was on the >> plate to fix. You see, of the three database servers one had been set >> up in the wrong place in the datacenter[3]; it worked, but because it >> was there it kept needed special exceptions in the firewall rules which >> was not only a maintenance issue, but was error prone and less secure. >> >> Fixing /that/ would be a simple thing; it only needs a short downtime >> while someone actually physically hauls the hardware from one place in >> the datacenter to another; and change its IP address. >> >> That went well, and in less than an hour the database was sitting >> happily in its new rack with its new IP address. >> >> Now, at that point, the networking configuration needs to be changed >> anyways, and since the databases had been merged[4], it was obvious that >> this was the right time to rip out the ugly networking rules that had >> become noops and by now just added a layer of needless complexity. >> >> That also went well, except for one niggling detail[5]: the databases on >> the third servers never /did/ get merged like the other two. Removing >> the networking rules had no effect on the first two (as expected) but >> now only the first of three databases on the third was accessible. >> >> Worse: it *looks* like the other two databases are still happily working >> since you apparently can still connect to them (but end up connected to >> the wrong one). >> >> So, the change is made accompanied with some tests and all seems fine, >> because, out of the dozen or so project databases I tested, I didn't >> happen to test connecting to a database that used to be on the two out >> of seven slices that are no longer visible. >> >> Monday comes, panic ensues. In the end, we decided to merge the >> databases on the third server as the fix (that took around a day), and >> we're back to working status with everything done. >> >> Like all good tales, this one has a moral[6]. No change is so obvious >> that it doesn't require careful planning. The disruption over the >> weekend was due only to the fact that I didn't take the time to double >> check my assumptions because the change was "trivial". >> >> Or, as I learned while wiping the egg from my face, would have *been* >> trivial if my assumptions matched reality. >> >> Exit sysadmin stage left, head hung low in shame at his hubris exposed. >> >> -- Marc >> >> [1] The "iptable rules" you may have heard mentionned on occasions. >> Basically, just a set of NAT rules to redirect faux IPs standing in for >> the servers to the right IP and port. >> >> [2] Pay attention here, that's some skillful foreshadowing right there. >> >> [3] Moved from one row of eqiad to another, for those keeping score. >> >> [4] If you've been following at home, you already see where this is >> heading. >> >> [5] Also, the change was done on a Friday. "But it's just a trivial >> change!" >> >> [6] Well, two morals if you count the "Don't do a change before you >> leave for the weekend!" beating I also gave myself. >> >> >> _______________________________________________ >> Labs-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/labs-l >> > > > _______________________________________________ > Labs-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/labs-l > >
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
