Thanks! I know you're doing your best to deal with outages and performance issues. Out of curiosity, do you foresee the Foundation allocating some more dedicated people/hardware for Labs?

Il 22/02/2015 01:57, Andrew Bogott ha scritto:
On 2/20/15 8:07 AM, Ricordisamoa wrote:
Thank you.
I (and probably many others) would like someone from the Ops team to elaborate on the uptime and general reliability Labs (especially Tools) is supposed to have, and for what kind of services it is suitable for, to prevent future misunderstandings in regards to loss of important work, etc.
Hello!

I don't want to ignore your question, but I also don't exactly know how to answer it. We're very unlikely to be able to project any kind of future uptime percentage, because currently labs runs on few enough servers that any attempt to predict uptime by multiplying failure rates by server counts would produce such giant error bars as to be useless.

Nonetheless, I can recap our uptime and storage vulnerabilities so that you know what to be wary of.

Bad news:

- Each labs instance is stored on a single server. If any one server is destroyed in a catastrophe (e.g. hard-drive crash, blow from a pickaxe, etc.) the state of all contained VMs will be suspended or, in extreme cases, lost. [1]

- There are three full-time Operations staff-members dedicated to supporting labs. We don't cover all timezones perfectly, and sometimes we take weekends and vacations. [2]

- Although the Tools grid engine is distributed among many instances (and, consequently, many physical servers), actual tools usage relies on several single points of failure, the most obvious of which is the web proxy. [3]

- All of labs currently lives in a single datacenter. It's a very dependable datacenter, but nonetheless vulnerable to cable cuts, fires, and other local disaster scenarios. [4]

Good news:

- Problems like the Ghost vulnerability which mandated a reboot of all hardware in late January are very rare.

- The cause of the outage on Tuesday was quite bad (and quite unusual), and we nevertheless were able to recover from it without data loss. https://wikitech.wikimedia.org/wiki/Incident_documentation/20150217-LabsOutage

- Yuvi has churned out a ton of great monitoring tools which mean that we're ever more aware of and responsive to incidents that might precede outages.

- Use of Labs and Tools is growing like crazy! This means that the Labs team is stretched a bit thin rushing to keep up, but I have a hard time thinking of this as bad news.

I'm aware that this response is entirely qualitative, and that you might prefer some actual quantities and statistics. I'm not reluctant to provide those, but I simply don't know where to begin. If you have any specific questions that would help address your particular concerns, please don't hesitate to ask.

-Andrew



[1] This is consistent with a 'cattle, not pets' design pattern. For example, all tools instances are fully puppetized and any lost instance can be replaced with a few minutes' work. Labs users outside of the Tools project should hew closely to this design model as well. This vulnerability could be partially mitigated with something like https://phabricator.wikimedia.org/T90364 but that has potential downsides.

Note that data stored on shared NFS servers and in Databases is highly redundant and much less subject to destruction.

[2] Potential mitigation for this is obvious, but extremely expensive :(

[3] Theoretical mitigation for this is https://phabricator.wikimedia.org/T89995, for which I would welcome a Hackathon collaborator

[4] I believe that there are plans in place for backup replication of NFS and Database data to a second data center; I will let Coren and Sean comment on the specifics.

_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l


_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l

Reply via email to