Re: [Labs-l] Partial (but dramatic) labs outage on Tuesday: 2015-02-24 1500UTC-1800UTC

Ricordisamoa Sat, 21 Feb 2015 17:34:53 -0800

Thanks! I know you're doing your best to deal with outages andperformance issues.Out of curiosity, do you foresee the Foundation allocating some morededicated people/hardware for Labs?


Il 22/02/2015 01:57, Andrew Bogott ha scritto:

On 2/20/15 8:07 AM, Ricordisamoa wrote:
Thank you.
I (and probably many others) would like someone from the Ops team toelaborate on the uptime and general reliability Labs (especiallyTools) is supposed to have, and for what kind of services it issuitable for, to prevent future misunderstandings in regards to lossof important work, etc.
Hello!
I don't want to ignore your question, but I also don't exactly knowhow to answer it. We're very unlikely to be able to project any kindof future uptime percentage, because currently labs runs on few enoughservers that any attempt to predict uptime by multiplying failurerates by server counts would produce such giant error bars as to beuseless.
Nonetheless, I can recap our uptime and storage vulnerabilities sothat you know what to be wary of.
Bad news:
- Each labs instance is stored on a single server. If any one serveris destroyed in a catastrophe (e.g. hard-drive crash, blow from apickaxe, etc.) the state of all contained VMs will be suspended or, inextreme cases, lost. [1]
- There are three full-time Operations staff-members dedicated tosupporting labs. We don't cover all timezones perfectly, andsometimes we take weekends and vacations. [2]
- Although the Tools grid engine is distributed among many instances(and, consequently, many physical servers), actual tools usage relieson several single points of failure, the most obvious of which is theweb proxy. [3]
- All of labs currently lives in a single datacenter. It's a verydependable datacenter, but nonetheless vulnerable to cable cuts,fires, and other local disaster scenarios. [4]
Good news:
- Problems like the Ghost vulnerability which mandated a reboot of allhardware in late January are very rare.
- The cause of the outage on Tuesday was quite bad (and quiteunusual), and we nevertheless were able to recover from it withoutdata loss.https://wikitech.wikimedia.org/wiki/Incident_documentation/20150217-LabsOutage
- Yuvi has churned out a ton of great monitoring tools which mean thatwe're ever more aware of and responsive to incidents that mightprecede outages.
- Use of Labs and Tools is growing like crazy! This means that theLabs team is stretched a bit thin rushing to keep up, but I have ahard time thinking of this as bad news.
I'm aware that this response is entirely qualitative, and that youmight prefer some actual quantities and statistics. I'm not reluctantto provide those, but I simply don't know where to begin. If you haveany specific questions that would help address your particularconcerns, please don't hesitate to ask.
-Andrew
[1] This is consistent with a 'cattle, not pets' design pattern. Forexample, all tools instances are fully puppetized and any lostinstance can be replaced with a few minutes' work. Labs users outsideof the Tools project should hew closely to this design model as well.This vulnerability could be partially mitigated with something likehttps://phabricator.wikimedia.org/T90364 but that has potentialdownsides.
Note that data stored on shared NFS servers and in Databases is highlyredundant and much less subject to destruction.
[2] Potential mitigation for this is obvious, but extremely expensive :(
[3] Theoretical mitigation for this ishttps://phabricator.wikimedia.org/T89995, for which I would welcome aHackathon collaborator
[4] I believe that there are plans in place for backup replication ofNFS and Database data to a second data center; I will let Coren andSean comment on the specifics.
_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l



_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l

Re: [Labs-l] Partial (but dramatic) labs outage on Tuesday: 2015-02-24 1500UTC-1800UTC

Reply via email to