Re: [Labs-l] Partial (but dramatic) labs outage on Tuesday: 2015-02-24 1500UTC-1800UTC

Andrew Bogott Sat, 21 Feb 2015 18:01:06 -0800

On 2/21/15 5:34 PM, Ricordisamoa wrote:

Thanks! I know you're doing your best to deal with outages andperformance issues.Out of curiosity, do you foresee the Foundation allocating some morededicated people/hardware for Labs?

We have only just added a third full-time engineer, Yuvi. My preferencegoing forward is to distribute labs knowledge more widely through theOps team so that there /many/ more people available to help in a pinch.We've been documenting and scripting as much as we can to facilitatethat... if everything is still falling to just the three of us a fewmonths from now then we can start lobbying for a fourth dedicated engineer.

Labs isn't especially constrained by hardware limitations; it's muchmore a question of human bandwidth to adequately manage what hardware wehave. The foundation has been quick to fund Labs hardware requests whenwe make them -- the pain is generally in transition and managementrather than actual limited financial resources. Case in point: a shinynew pile of hard drives is the /cause/ of the outage in the subject line :)


-Andrew

Il 22/02/2015 01:57, Andrew Bogott ha scritto:
On 2/20/15 8:07 AM, Ricordisamoa wrote:
Thank you.
I (and probably many others) would like someone from the Ops team toelaborate on the uptime and general reliability Labs (especiallyTools) is supposed to have, and for what kind of services it issuitable for, to prevent future misunderstandings in regards to lossof important work, etc.
Hello!
I don't want to ignore your question, but I also don't exactly knowhow to answer it. We're very unlikely to be able to project any kindof future uptime percentage, because currently labs runs on fewenough servers that any attempt to predict uptime by multiplyingfailure rates by server counts would produce such giant error bars asto be useless.
Nonetheless, I can recap our uptime and storage vulnerabilities sothat you know what to be wary of.
Bad news:
- Each labs instance is stored on a single server. If any one serveris destroyed in a catastrophe (e.g. hard-drive crash, blow from apickaxe, etc.) the state of all contained VMs will be suspended or,in extreme cases, lost. [1]
- There are three full-time Operations staff-members dedicated tosupporting labs. We don't cover all timezones perfectly, andsometimes we take weekends and vacations. [2]
- Although the Tools grid engine is distributed among many instances(and, consequently, many physical servers), actual tools usage relieson several single points of failure, the most obvious of which is theweb proxy. [3]
- All of labs currently lives in a single datacenter. It's a verydependable datacenter, but nonetheless vulnerable to cable cuts,fires, and other local disaster scenarios. [4]
Good news:
- Problems like the Ghost vulnerability which mandated a reboot ofall hardware in late January are very rare.
- The cause of the outage on Tuesday was quite bad (and quiteunusual), and we nevertheless were able to recover from it withoutdata loss.https://wikitech.wikimedia.org/wiki/Incident_documentation/20150217-LabsOutage
- Yuvi has churned out a ton of great monitoring tools which meanthat we're ever more aware of and responsive to incidents that mightprecede outages.
- Use of Labs and Tools is growing like crazy! This means that theLabs team is stretched a bit thin rushing to keep up, but I have ahard time thinking of this as bad news.
I'm aware that this response is entirely qualitative, and that youmight prefer some actual quantities and statistics. I'm notreluctant to provide those, but I simply don't know where to begin.If you have any specific questions that would help address yourparticular concerns, please don't hesitate to ask.
-Andrew
[1] This is consistent with a 'cattle, not pets' design pattern. Forexample, all tools instances are fully puppetized and any lostinstance can be replaced with a few minutes' work. Labs usersoutside of the Tools project should hew closely to this design modelas well. This vulnerability could be partially mitigated withsomething like https://phabricator.wikimedia.org/T90364 but that haspotential downsides.
Note that data stored on shared NFS servers and in Databases ishighly redundant and much less subject to destruction.
[2] Potential mitigation for this is obvious, but extremely expensive :(
[3] Theoretical mitigation for this ishttps://phabricator.wikimedia.org/T89995, for which I would welcome aHackathon collaborator
[4] I believe that there are plans in place for backup replication ofNFS and Database data to a second data center; I will let Coren andSean comment on the specifics.
_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l
_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l



_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l

Re: [Labs-l] Partial (but dramatic) labs outage on Tuesday: 2015-02-24 1500UTC-1800UTC

Reply via email to